## Hello!
You already know what ```pandas``` is, can open files, can do general assessment and selecting methods with the package, and want to learn how to clean or manipulate your data? You've come to the right place! This tutorial will cover the most important cleaning functions and methods, as well as data manipulation with ```pandas```.

*If you're unsure about how to open files and the basics, you should check the <a href="https://github.com/lona9/PythonTutorials/blob/master/Basic%20pandas.ipynb">Basic pandas Tutorial</a> first!<br>
If you're unsure about assessment with pandas, you should check the <a href="https://github.com/lona9/PythonTutorials/blob/master/Assessing%20with%20pandas.ipynb">Assessing with pandas Tutorial</a> first!*

## Menu
- <a href="#drop">Dropping values</a>
- <a href="#datatypes">Changing datatypes</a>
- <a href="#replace">replace method</a>
- <a href="#apply">apply method</a>
- <a href="#insert">insert method</a>
- <a href="#more">More pandas</a>

As we've already seen in the assessment tutorial, we have some not so clean rows, with missing data or duplicated rows. We're going to see what we can do about those with some helpful cleaning methods.<br>
Before doing any cleaning, it's best to create a copy DataFrame, instead of editing the original one, so we don't lose the original DataFrame state, and we can use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html">```copy```</a> to create this new DataFrame, which we can then freely clean.

In [1]:
import pandas as pd

In [2]:
column_names = ["patient_id", "appointment_id", "gender", "scheduled_date", "appointment_date", "age", "neighbourhood", "scholarship", "hipertension", "diabetes", "alcoholism", "handcap", "sms_received", "no_show"]

df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv", header=0, names=column_names)

In [3]:
df_copy = df.copy()
df_copy.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,0.0,0.0,0.0,0.0,0.0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0.0,0.0,0.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No


<a id="drop"></a>
## Dropping values

So now we can deal with some rows. For duplicates, we can use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html">```drop```</a> to remove these rows, as we don't need any duplicated data. We can recognize duplicates in this DataFrame using the ```appointment_id``` column, as it should act as a primary key. We're going to check how many duplicated values we have in that column, using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html">```duplicated()```</a>

In [4]:
df_copy.appointment_id.duplicated().sum()

10

We don't want these duplicated rows, so we can remove them easily using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html">```drop_duplicates()```</a>, which can remove duplicated values based on all data, or we can specify the column like the following example:

In [5]:
df_copy.drop_duplicates(subset=["appointment_id"])

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,2.987250e+13,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,No
1,5.589978e+14,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,0.0,0.0,0.0,0.0,0.0,No
2,4.262962e+12,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0.0,0.0,0.0,0.0,0.0,0.0,No
3,8.679512e+11,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
4,8.841186e+12,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110532,2.572134e+12,5651768,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0.0,0.0,0.0,0.0,0.0,1.0,No
110533,3.596266e+12,5650093,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0.0,0.0,0.0,0.0,0.0,1.0,No
110534,1.557663e+13,5630692,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0.0,0.0,0.0,0.0,0.0,1.0,No
110535,9.213493e+13,5630323,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0.0,0.0,0.0,0.0,0.0,1.0,No


In [6]:
df_copy.appointment_id.duplicated().sum()

10

Wait, the duplicates are still there? That's because this method returns a DataFrame with the duplicated rows removed, but it doesn't actually remove them from the DataFrame as you can see on the previous cell. To do this, we must set the implicit argument ```inplace``` as **True**.

In [7]:
df_copy.drop_duplicates(subset=["appointment_id"], inplace=True)

In [8]:
df_copy.appointment_id.duplicated().sum()

0

Now it's done! We can also use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html">```drop()```</a> to remove specific rows or columns based on position or labels. ```drop()``` has an implicit ```axis``` set to 0, which references rows by fault. You can change it to ```axis = 1``` to remove columns by position as well.

In [9]:
# Creating a new df to show examples of column removal
df2 = df_copy.copy()

# Removing rows 1 and 2 
df2.drop([1,2], inplace=True)
df2.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No
5,95985130000000.0,5626772,F,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,76,REPÚBLICA,0.0,1.0,0.0,0.0,0.0,0.0,No
6,733688200000000.0,5630279,F,2016-04-27T15:05:12Z,2016-04-29T00:00:00Z,23,GOIABEIRAS,0.0,0.0,0.0,0.0,0.0,0.0,Yes


In [10]:
df2.drop("sms_received", axis=1, inplace=True)
df2.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,no_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,No
5,95985130000000.0,5626772,F,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,76,REPÚBLICA,0.0,1.0,0.0,0.0,0.0,No
6,733688200000000.0,5630279,F,2016-04-27T15:05:12Z,2016-04-29T00:00:00Z,23,GOIABEIRAS,0.0,0.0,0.0,0.0,0.0,Yes


In [11]:
df2.drop(columns=["age", "gender"], inplace=True)
df2.head()

Unnamed: 0,patient_id,appointment_id,scheduled_date,appointment_date,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,no_show
0,29872500000000.0,5642903,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,No
5,95985130000000.0,5626772,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,REPÚBLICA,0.0,1.0,0.0,0.0,0.0,No
6,733688200000000.0,5630279,2016-04-27T15:05:12Z,2016-04-29T00:00:00Z,GOIABEIRAS,0.0,0.0,0.0,0.0,0.0,Yes


We also know we have some null values. We can do several things when we find these. We can leave them as ```NaN``` and just move on with life, we can drop the rows, using the method we just learned, we can replace these values using the average or a default value if there is one, or we can replace them with a non-null value that still shows there's no data for that row.
For null values removal, besides ```drop```, we also have <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html">```dropna```</a>, which makes it all faster and easier, because we don't have to manually identify which rows are null to remove them.

In [12]:
df_copy.isnull().sum()

patient_id          0
appointment_id      0
gender              4
scheduled_date      0
appointment_date    0
age                 0
neighbourhood       0
scholarship         2
hipertension        2
diabetes            2
alcoholism          4
handcap             3
sms_received        2
no_show             0
dtype: int64

In [13]:
df_copy.dropna(inplace=True)
df_copy.isnull().sum()

patient_id          0
appointment_id      0
gender              0
scheduled_date      0
appointment_date    0
age                 0
neighbourhood       0
scholarship         0
hipertension        0
diabetes            0
alcoholism          0
handcap             0
sms_received        0
no_show             0
dtype: int64

We could also use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html">```fillna()```</a> to replace the null values with something else, like a 0, the average, median, a 'None' ```str```, whatever we decide it's best for our data.

<a id="datatypes"></a>
## Changing datatypes

It's always important to set datatypes correctly for better processing of our data. We can use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html?highlight=astype">```astype()```</a> to change the datatype of our columns. For example, our ```patient_id``` column uses numbers for the ID, but since it's a really big number, and it's being processed at a ```float```, we can't see the full number and we're looking at the IDs on scientific notation instead. We can try to revert this by converting the datatype on this column to ```int```, and get the full number without notation.

In [14]:
df_copy.patient_id = df_copy.patient_id.astype(int)
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110510 entries, 0 to 110536
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   patient_id        110510 non-null  int64  
 1   appointment_id    110510 non-null  int64  
 2   gender            110510 non-null  object 
 3   scheduled_date    110510 non-null  object 
 4   appointment_date  110510 non-null  object 
 5   age               110510 non-null  int64  
 6   neighbourhood     110510 non-null  object 
 7   scholarship       110510 non-null  float64
 8   hipertension      110510 non-null  float64
 9   diabetes          110510 non-null  float64
 10  alcoholism        110510 non-null  float64
 11  handcap           110510 non-null  float64
 12  sms_received      110510 non-null  float64
 13  no_show           110510 non-null  object 
dtypes: float64(6), int64(3), object(5)
memory usage: 12.6+ MB


We also have two date columns, which both have an ```object``` datatype, which is used for ```str```. If we wanted to make use of ```datetime``` functions and methods, we wouldn't be able to do so with the values as strings, so we should covert them to datetime objects with the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html">```to_datetime()```</a> method included in ```pandas```.

In [15]:
df_copy.scheduled_date = pd.to_datetime(df_copy.scheduled_date)
df_copy.appointment_date = pd.to_datetime(df_copy.appointment_date)
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110510 entries, 0 to 110536
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype              
---  ------            --------------   -----              
 0   patient_id        110510 non-null  int64              
 1   appointment_id    110510 non-null  int64              
 2   gender            110510 non-null  object             
 3   scheduled_date    110510 non-null  datetime64[ns, UTC]
 4   appointment_date  110510 non-null  datetime64[ns, UTC]
 5   age               110510 non-null  int64              
 6   neighbourhood     110510 non-null  object             
 7   scholarship       110510 non-null  float64            
 8   hipertension      110510 non-null  float64            
 9   diabetes          110510 non-null  float64            
 10  alcoholism        110510 non-null  float64            
 11  handcap           110510 non-null  float64            
 12  sms_received      110510 non-null  float64  

As we clean, we might want to manipulate our values to something that could be more useful during our analysis, or we might want to create new columns based on the values we already have. We have a lot of methods to use for value manipulation, and we'll take a look at three of the most helpful and easier to use to start.

<a id="replace"></a>
## replace method

The simplest thing we might want to do is to replace the values we have, to make it easier down the road to analyze or to operate on the DataFrame. Let's take the ```no_show``` column for example. Right now, the datatype used is a string which says ```'Yes'``` or ```'No'``` depending if the person showed up for the appointment or not. This is a bit counterintuitive, as the column name already says no. A boolean expression could be more appropiate for this column as well, so we might want to replace these Yes/No values to have a boolean column instead. The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html">```replace```</a> method is what we need to do this. The main arguments are ```to_replace``` and ```value```, which can both take numeric, strings or even lists values, and the ```inplace``` argument to pass these changes directly to the DataFrame.

In [16]:
df_copy.no_show.replace(to_replace=["Yes", "No"], value = [True, False], inplace=True)
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110510 entries, 0 to 110536
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype              
---  ------            --------------   -----              
 0   patient_id        110510 non-null  int64              
 1   appointment_id    110510 non-null  int64              
 2   gender            110510 non-null  object             
 3   scheduled_date    110510 non-null  datetime64[ns, UTC]
 4   appointment_date  110510 non-null  datetime64[ns, UTC]
 5   age               110510 non-null  int64              
 6   neighbourhood     110510 non-null  object             
 7   scholarship       110510 non-null  float64            
 8   hipertension      110510 non-null  float64            
 9   diabetes          110510 non-null  float64            
 10  alcoholism        110510 non-null  float64            
 11  handcap           110510 non-null  float64            
 12  sms_received      110510 non-null  float64  

In [17]:
df_copy.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872499824296,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,False
1,558997776694438,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0.0,0.0,0.0,0.0,0.0,0.0,False
2,4262962299951,5642549,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,MATA DA PRAIA,0.0,0.0,0.0,0.0,0.0,0.0,False
3,867951213174,5642828,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,False
4,8841186448183,5642494,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,False


You can also use regular expressions as arguments for this method, which could be helpful when you need to extract patterns from an object. You can find a really good tutorial <a href="https://regexone.com/">here</a> to learn more about regex and how to write your own expressions.

<a id="apply"></a>
## apply method

For performing functions along an axis, <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html">```apply```</a> is a great method to use. This method takes the function as a first argument, and has an ```axis``` argument to set to columns (```axis = 1```) or rows (```axis = 0```, default value), and returns the result of the applied function. This could take different functions, either defined by a package, user-defined functions or ```lambda``` functions.

In [18]:
import numpy as np

In [19]:
# function from an imported package
df_copy.age.apply(np.sqrt)

0         7.874008
1         7.483315
2         7.874008
3         2.828427
4         7.483315
            ...   
110532    7.483315
110533    7.141428
110534    4.582576
110535    6.164414
110536    7.348469
Name: age, Length: 110510, dtype: float64

In [20]:
# user-defined function
def age_bracket(row):
    if row <= 17:
        return "Minor"
    if row > 17:
        return "Adult"

In [21]:
df_copy.age.apply(age_bracket)

0         Adult
1         Adult
2         Adult
3         Minor
4         Adult
          ...  
110532    Adult
110533    Adult
110534    Adult
110535    Adult
110536    Adult
Name: age, Length: 110510, dtype: object

We could use this function to replace the capitalized values in the ```neighbourhood``` column with values where only the first letter of the word is capitalized, passing the <a href="https://docs.python.org/3/library/stdtypes.html#str.title">```title```</a> string method to ```apply```.

In [23]:
neighbourhood_corrected = df_copy.neighbourhood.apply(lambda x: x.title())

neighbourhood_replace = neighbourhood_corrected.unique()
print(neighbourhood_replace)

['Jardim Da Penha' 'Mata Da Praia' 'Pontal De Camburi' 'República'
 'Goiabeiras' 'Andorinhas' 'Conquista' 'Nova Palestina' 'Da Penha'
 'Tabuazeiro' 'Bento Ferreira' 'São Pedro' 'Santa Martha' 'São Cristóvão'
 'Maruípe' 'Grande Vitória' 'São Benedito' 'Ilha Das Caieiras'
 'Santo André' 'Solon Borges' 'Bonfim' 'Jardim Camburi' 'Maria Ortiz'
 'Jabour' 'Antônio Honório' 'Resistência' 'Ilha De Santa Maria'
 'Jucutuquara' 'Monte Belo' 'Mário Cypreste' 'Santo Antônio' 'Bela Vista'
 'Praia Do Suá' 'Santa Helena' 'Itararé' 'Inhanguetá' 'Universitário'
 'São José' 'Redenção' 'Santa Clara' 'Centro' 'Parque Moscoso'
 'Do Moscoso' 'Santos Dumont' 'Caratoíra' 'Ariovaldo Favalessa'
 'Ilha Do Frade' 'Gurigica' 'Joana D´Arc' 'Consolação' 'Praia Do Canto'
 'Boa Vista' 'Morada De Camburi' 'Santa Luíza' 'Santa Lúcia'
 'Barro Vermelho' 'Estrelinha' 'Forte São João' 'Fonte Grande'
 'Enseada Do Suá' 'Santos Reis' 'Piedade' 'Jesus De Nazareth'
 'Santa Tereza' 'Cruzamento' 'Ilha Do Príncipe' 'Romão' 'Comdusa'


In [24]:
neighbourhood_toreplace = df_copy.neighbourhood.unique()
print(neighbourhood_toreplace)

['JARDIM DA PENHA' 'MATA DA PRAIA' 'PONTAL DE CAMBURI' 'REPÚBLICA'
 'GOIABEIRAS' 'ANDORINHAS' 'CONQUISTA' 'NOVA PALESTINA' 'DA PENHA'
 'TABUAZEIRO' 'BENTO FERREIRA' 'SÃO PEDRO' 'SANTA MARTHA' 'SÃO CRISTÓVÃO'
 'MARUÍPE' 'GRANDE VITÓRIA' 'SÃO BENEDITO' 'ILHA DAS CAIEIRAS'
 'SANTO ANDRÉ' 'SOLON BORGES' 'BONFIM' 'JARDIM CAMBURI' 'MARIA ORTIZ'
 'JABOUR' 'ANTÔNIO HONÓRIO' 'RESISTÊNCIA' 'ILHA DE SANTA MARIA'
 'JUCUTUQUARA' 'MONTE BELO' 'MÁRIO CYPRESTE' 'SANTO ANTÔNIO' 'BELA VISTA'
 'PRAIA DO SUÁ' 'SANTA HELENA' 'ITARARÉ' 'INHANGUETÁ' 'UNIVERSITÁRIO'
 'SÃO JOSÉ' 'REDENÇÃO' 'SANTA CLARA' 'CENTRO' 'PARQUE MOSCOSO'
 'DO MOSCOSO' 'SANTOS DUMONT' 'CARATOÍRA' 'ARIOVALDO FAVALESSA'
 'ILHA DO FRADE' 'GURIGICA' 'JOANA D´ARC' 'CONSOLAÇÃO' 'PRAIA DO CANTO'
 'BOA VISTA' 'MORADA DE CAMBURI' 'SANTA LUÍZA' 'SANTA LÚCIA'
 'BARRO VERMELHO' 'ESTRELINHA' 'FORTE SÃO JOÃO' 'FONTE GRANDE'
 'ENSEADA DO SUÁ' 'SANTOS REIS' 'PIEDADE' 'JESUS DE NAZARETH'
 'SANTA TEREZA' 'CRUZAMENTO' 'ILHA DO PRÍNCIPE' 'ROMÃO' 'COMDUSA'


In [25]:
df_copy.neighbourhood.replace(to_replace=neighbourhood_toreplace, value=neighbourhood_replace, inplace=True)
df_copy.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872499824296,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,Jardim Da Penha,0.0,1.0,0.0,0.0,0.0,0.0,False
1,558997776694438,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,Jardim Da Penha,0.0,0.0,0.0,0.0,0.0,0.0,False
2,4262962299951,5642549,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,Mata Da Praia,0.0,0.0,0.0,0.0,0.0,0.0,False
3,867951213174,5642828,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,Pontal De Camburi,0.0,0.0,0.0,0.0,0.0,0.0,False
4,8841186448183,5642494,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,Jardim Da Penha,0.0,1.0,1.0,0.0,0.0,0.0,False


<a id="insert"></a>
## insert method

The <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html">```insert```</a> method is as straight-forward as it gets, allowing us to insert a new column in our DataFrame. It takes 3 arguments: ```loc```, to assign a position for our new column, ```column```, which will give the label to our column, and ```value```, which will take the Series or list with the values our column will have.<br>
Let's take one of the prior exercises as an example for ```insert```. Let's say we wanted a column next to the age column, which will tell us whether a patient is a minor or not:

In [26]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110510 entries, 0 to 110536
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype              
---  ------            --------------   -----              
 0   patient_id        110510 non-null  int64              
 1   appointment_id    110510 non-null  int64              
 2   gender            110510 non-null  object             
 3   scheduled_date    110510 non-null  datetime64[ns, UTC]
 4   appointment_date  110510 non-null  datetime64[ns, UTC]
 5   age               110510 non-null  int64              
 6   neighbourhood     110510 non-null  object             
 7   scholarship       110510 non-null  float64            
 8   hipertension      110510 non-null  float64            
 9   diabetes          110510 non-null  float64            
 10  alcoholism        110510 non-null  float64            
 11  handcap           110510 non-null  float64            
 12  sms_received      110510 non-null  float64  

In [27]:
age_bracket = df_copy.age.apply(age_bracket)

df_copy.insert(6, "age_bracket", age_bracket)

In [28]:
df_copy.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,age_bracket,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872499824296,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,Adult,Jardim Da Penha,0.0,1.0,0.0,0.0,0.0,0.0,False
1,558997776694438,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,Adult,Jardim Da Penha,0.0,0.0,0.0,0.0,0.0,0.0,False
2,4262962299951,5642549,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,Adult,Mata Da Praia,0.0,0.0,0.0,0.0,0.0,0.0,False
3,867951213174,5642828,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,Minor,Pontal De Camburi,0.0,0.0,0.0,0.0,0.0,0.0,False
4,8841186448183,5642494,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,Adult,Jardim Da Penha,0.0,1.0,1.0,0.0,0.0,0.0,False


We could also create a new column based on previous records on the DataFrame: besides having 4 separate columns for comorbilities in the DataFrame, we could have one column to have a disease total for a general approximation of the patient's health state.

In [30]:
disease_total = df_copy.hipertension + df_copy.diabetes + df_copy.alcoholism + df_copy.handcap
df_copy.insert(9, "disease_total", disease_total)

In [31]:
df_copy.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,age_bracket,neighbourhood,scholarship,disease_total,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872499824296,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,Adult,Jardim Da Penha,0.0,1.0,1.0,0.0,0.0,0.0,0.0,False
1,558997776694438,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,Adult,Jardim Da Penha,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
2,4262962299951,5642549,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,Adult,Mata Da Praia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
3,867951213174,5642828,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,Minor,Pontal De Camburi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False
4,8841186448183,5642494,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,Adult,Jardim Da Penha,0.0,2.0,1.0,1.0,0.0,0.0,0.0,False


<a id="#more"></a>
## More pandas
We've reached the end of these ```pandas``` tutorials! We hope you found them useful for your data exploration and manipulation activities, there are a lot more methods and functions to discover and use with ```pandas```, so feel free to check the documentation or and find new methods to help you with the specific issues your data might have, and have fun!<br>
Here are some additional resources if you want to learn more stuff you can do with the package:<br>
- <a href="https://www.educative.io/blog/pandas-cheat-sheet">Pandas cheat sheet</a>
- <a href="https://gallery.azure.ai/Experiment/Methods-for-handling-missing-values-1">Methods for handling missing values</a>
- <a href="https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6">A Beginner’s Guide to Optimizing Pandas Code for Speed</a>
- <a href="https://matplotlib.org/">Matplotlib library for visualizations</a>
- <a href="https://seaborn.pydata.org/tutorial.html">Seaborn library for visualizations</a>