## Hello!
You already know what ```pandas``` is, can open files, and can do general assessment and selecting methods with the package, and want to learn how to clean or manipulate your data? You've come to the right place! This tutorial will cover the most important cleaning functions and methods, as well as data manipulation with ```pandas```.

*If you're unsure about how to open files and the basics, you should check the <a href="https://github.com/lona9/PythonTutorials/blob/master/Basic%20pandas.ipynb">Basic pandas Tutorial</a> first!<br>
If you're unsure about assessment with pandas, you should check the <a href="https://github.com/lona9/PythonTutorials/blob/master/Assessing%20with%20pandas.ipynb">Assessing with pandas Tutorial</a> first!*

## Menu
- <a href="#cleaning">Cleaning methods</a>
- <a href="#value">Value manipulation</a>
- <a href="#apply">Apply for functions</a>

<a id="cleaning"></a>
## Cleaning methods
As we've already seen in the assessment tutorial, we have some not so clean rows, with missing data or duplicated rows. We're going to see what we can do about those with some helpful cleaning methods.<br>
Before doing any cleaning, it's best to create a copy DataFrame, instead of editing the original one, so we don't lose the original DataFrame state, and we can use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html">```copy```</a> to create this new DataFrame, which we can freely clean.

In [27]:
df_copy = df.copy()
df_copy.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,0.0,0.0,0.0,0.0,0.0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0.0,0.0,0.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No


So now we can come back and deal with some rows. For duplicates, we can use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html">```drop```</a> to remove these rows, as we don't need any duplicated data. We can recognize duplicates in this DataFrame using the ```appointment_id``` column, as it should act as a primary key. We're going to check how many duplicated values we have in that column, using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html">```duplicated()```</a>

In [28]:
df_copy.appointment_id.duplicated().sum()

10

We don't want these duplicated rows, so we can remove them easily using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html">```drop_duplicates()```</a>, which can remove duplicated values based on all data, or we can specify the column like the following example:

In [29]:
df_copy.drop_duplicates(subset=["appointment_id"])

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,2.987250e+13,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,No
1,5.589978e+14,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,0.0,0.0,0.0,0.0,0.0,No
2,4.262962e+12,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0.0,0.0,0.0,0.0,0.0,0.0,No
3,8.679512e+11,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
4,8.841186e+12,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110532,2.572134e+12,5651768,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0.0,0.0,0.0,0.0,0.0,1.0,No
110533,3.596266e+12,5650093,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0.0,0.0,0.0,0.0,0.0,1.0,No
110534,1.557663e+13,5630692,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0.0,0.0,0.0,0.0,0.0,1.0,No
110535,9.213493e+13,5630323,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0.0,0.0,0.0,0.0,0.0,1.0,No


In [30]:
df_copy.appointment_id.duplicated().sum()

10

This method returns a DataFrame with the duplicated rows removed, but it doesn't actually remove them from the DataFrame as you can see on the previous cells. To do this, we must set the implicit argument ```inplace``` as **True**.

In [31]:
df_copy.drop_duplicates(subset=["appointment_id"], inplace=True)

In [32]:
df_copy.appointment_id.duplicated().sum()

0

Now it's done! We can also use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html">```drop()```</a> to remove specific rows or columns based on position or labels. ```drop()``` has an implicit ```axis``` set to 0, which references rows by fault. You can change it to ```axis = 1``` to remove columns by position as well.

In [33]:
# Creating a new df to show examples of column removal
df2 = df_copy.copy()

# Removing rows 1 and 2 
df2.drop([1,2], inplace=True)
df2.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No
5,95985130000000.0,5626772,F,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,76,REPÚBLICA,0.0,1.0,0.0,0.0,0.0,0.0,No
6,733688200000000.0,5630279,F,2016-04-27T15:05:12Z,2016-04-29T00:00:00Z,23,GOIABEIRAS,0.0,0.0,0.0,0.0,0.0,0.0,Yes


In [34]:
df2.drop("sms_received", axis=1, inplace=True)
df2.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,no_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,No
5,95985130000000.0,5626772,F,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,76,REPÚBLICA,0.0,1.0,0.0,0.0,0.0,No
6,733688200000000.0,5630279,F,2016-04-27T15:05:12Z,2016-04-29T00:00:00Z,23,GOIABEIRAS,0.0,0.0,0.0,0.0,0.0,Yes


In [35]:
df2.drop(columns=["age", "gender"], inplace=True)
df2.head()

Unnamed: 0,patient_id,appointment_id,scheduled_date,appointment_date,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,no_show
0,29872500000000.0,5642903,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,No
5,95985130000000.0,5626772,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,REPÚBLICA,0.0,1.0,0.0,0.0,0.0,No
6,733688200000000.0,5630279,2016-04-27T15:05:12Z,2016-04-29T00:00:00Z,GOIABEIRAS,0.0,0.0,0.0,0.0,0.0,Yes


We also know we have some null values. We can do several things when we find these. We can leave them as ```NaN``` and just move on with life, we can drop the rows, using the method we just learned, we can replace these values using the average or a default value is there is one, or we can replace them with a non-null value that still shows there's no data for that row.
For removal, besides ```drop```, we also have <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html">```dropna```</a>, which makes it all faster and easier, because we don't have to manually identify which rows are null to remove them.

In [36]:
df_copy.isnull().sum()

patient_id          0
appointment_id      0
gender              4
scheduled_date      0
appointment_date    0
age                 0
neighbourhood       0
scholarship         2
hipertension        2
diabetes            2
alcoholism          4
handcap             3
sms_received        2
no_show             0
dtype: int64

In [37]:
df_copy.dropna(inplace=True)
df_copy.isnull().sum()

patient_id          0
appointment_id      0
gender              0
scheduled_date      0
appointment_date    0
age                 0
neighbourhood       0
scholarship         0
hipertension        0
diabetes            0
alcoholism          0
handcap             0
sms_received        0
no_show             0
dtype: int64

We could also use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html">```fillna()```</a> to replace the null values with something else, like a 0, the average, median, a 'None' ```str```, whatever we decide it's best for our data.

Last but not least, we mentioned it's important to set datatypes correctly for better processing of our data. We can use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html?highlight=astype">```astype()```</a> to change the datatype of our columns. For example, our ```patient_id``` column uses numbers for the ID, but since it's a really big number, and it's being processed at a ```float```, we can't see the full number and we're looking at the IDs on scientific notation instead. We can try to revert this by converting the datatype on this column to ```int```, and get the full number without notation.

In [38]:
df_copy.patient_id = df_copy.patient_id.astype(int)
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110510 entries, 0 to 110536
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   patient_id        110510 non-null  int64  
 1   appointment_id    110510 non-null  int64  
 2   gender            110510 non-null  object 
 3   scheduled_date    110510 non-null  object 
 4   appointment_date  110510 non-null  object 
 5   age               110510 non-null  int64  
 6   neighbourhood     110510 non-null  object 
 7   scholarship       110510 non-null  float64
 8   hipertension      110510 non-null  float64
 9   diabetes          110510 non-null  float64
 10  alcoholism        110510 non-null  float64
 11  handcap           110510 non-null  float64
 12  sms_received      110510 non-null  float64
 13  no_show           110510 non-null  object 
dtypes: float64(6), int64(3), object(5)
memory usage: 12.6+ MB


We also have two date columns, which have an ```object``` datatype, which is used for ```str```. If we wanted to make use of ```datetime``` functions and methods, we wouldn't be able to do so with the values as strings, and we can covert them to datetime objects with the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html">```to_datetime()```</a> method included in ```pandas```.

In [39]:
df_copy.scheduled_date = pd.to_datetime(df_copy.scheduled_date)
df_copy.appointment_date = pd.to_datetime(df_copy.appointment_date)
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110510 entries, 0 to 110536
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype              
---  ------            --------------   -----              
 0   patient_id        110510 non-null  int64              
 1   appointment_id    110510 non-null  int64              
 2   gender            110510 non-null  object             
 3   scheduled_date    110510 non-null  datetime64[ns, UTC]
 4   appointment_date  110510 non-null  datetime64[ns, UTC]
 5   age               110510 non-null  int64              
 6   neighbourhood     110510 non-null  object             
 7   scholarship       110510 non-null  float64            
 8   hipertension      110510 non-null  float64            
 9   diabetes          110510 non-null  float64            
 10  alcoholism        110510 non-null  float64            
 11  handcap           110510 non-null  float64            
 12  sms_received      110510 non-null  float64  

<a id="value"></a>
## Value manipulation

split, join, replace, string manipulation

<a id="apply"></a>
## Apply for functions
