<!--NAVIGATION-->
< [Join Merge and Concatenate](02-join-merge-and-concatenate.ipynb) | [Contents](Index.ipynb) | [More Operation on Dataframe](04-more-operation-on-dataframe.ipynb) >

## Data cleaning and handling missing values

Here you will learn basics how to clean data and to handle missing values

To be able to test the functions, lets import a dataframe from dwh

In [None]:
import pandas as pd

df = pd.read_csv('../Data/airlineDT.csv', sep=',')
df.head(5)

### Data cleaning

Next part is to prepare a data cleaning.

In [None]:
df.head(5)

There are multiple ways to select row or column of the given dataframe. Here you can select a specific column.

In [None]:
df['DEP_TIME']

The same variable you can call simply as

In [None]:
df.DEP_TIME

Let's try to select 2D parts as in NumPy

In [None]:
df[3:20, 1:2]

Pandas will not understand straigth like that

Here you can select rows and specifying indexes. Notice that sign **:** specifies a range between given values.

In [None]:
df.iloc[:,0]

In case you want to clean the data, you can either select the column, row or to affect the whole dataframe. If you wnat to specify row and column, there are multiple ways to do it. One of the most practical technique is to define by **pd.loc** or **pd.iloc**:
- loc gets rows (and/or columns) with particular labels
- iloc gets rows (and/or columns) at integer locations

In [None]:
df.loc[3:20,'TIME_HOUR']

In [None]:
df.iloc[3:20, 1:2]

or

In [None]:
#-1 means I take the last value
df.iloc[3:20,[1,2]]

In case you want to replace specified values with different, you can use a replace function.

Lets check the unique values of variable DEST

In [None]:
df['CARRIER'].unique()

We want to replace given values to a more extended

In [None]:
df['TEST_CARRIER'] = df['CARRIER'].copy()
df['TEST_CARRIER'] = df['TEST_CARRIER'].replace({'UA':"UNITED AIRLINES","AA":"AMERICAN AIRLINES"})
df['TEST_CARRIER'].unique()

The same result you can achieve in different, simplier way. You can specify which column / indexes you want to affect and with which values.

In [None]:
df['TEST_CARRIER_2'] = df['CARRIER'].copy()
df.loc[ df['TEST_CARRIER_2']=="UA",'TEST_CARRIER_2'] = "UNITED AIRLINES"
df.loc[ df['TEST_CARRIER_2']=="AA",'TEST_CARRIER_2'] = "AMERICAN AIRLINES"
df['TEST_CARRIER_2'].unique()

The natural question comes which method to use since the outcome is the same. If you are more comfortable with one or other method which is easy to track, you should stick with it. pd.replace() method has a clear advantageous which is using json dictionary as an argument. To be more clear, lets say we are not sure which columns to change, we might change the name in the future, so we can write as:

In [None]:
columns_renamed = {'UA':"UNITED AIRLINES",
                   "AA":"AMERICAN AIRLINES"}

In [None]:
df['TEST_CARRIER'] = df['TEST_CARRIER'].replace(columns_renamed)

We are more comfortable to modify a json dictionary instead by trying to identify every affected column in **.loc** function. As well even there is nothing to replace, it will not give an error. In general, **.loc** and **.iloc** are very beneficial since you can clean data in many ways. But Pandas supports a vast of helpful functions, so if you know how to replace **.loc** or **.iloc** with more specific pandas functions which has the main aim for it, you should stick with specific function. A good example was given by pd.replace() function, how to be more effective in tracking and managing code process.

Let's say you want to create a new variable for the dataframe. You can create a new variable from the rest of the values or either the values.

In [None]:
df['NEW_VARIABLE'] = 0
df['NEW_VARIABLE']

In [None]:
df['TOTAL_TIME'] = df['DEP_TIME'] + df['ARR_TIME']
df['TOTAL_TIME']

### Missing values

Here you can learn how to detect which values either missing or not. Moreover, how to fill missing values.

In [None]:
df.isna()

You can notice that every value show either value is missing (True) or not (False). Moreover, you can select the values which are only positive, since by spectating the whole dataframe of it can look messy.

We can be more detailed and to create a filter for the dataframe by to filtering out values based on negative values on specified variable. Assuming we want to check which rows has missing CARRIER variable.

In [None]:
df.loc[df.isna(), 'YEAR']

Seems it is clear, now take a look at variable DEP_TIME.

In [None]:
df[df['DEP_TIME'].isna()]

Clearly you can see there are missing values. To have a quick look how many missing values are, we can count how many in general are.

In [None]:
df.count()

Clearly, the maximum number of rows is 7198, but not all variables has it, which indicates there are missing values.

Handling the missing values in pandas is easy. You can select either to fill the missing values for the whole dataset or for specific interval. First we want to separate rows with missing values to have a better view how it works.

In [None]:
df_missing = df[df['DEP_TIME'].isna()].copy()

Lets fill DEP_TIME missing values.

In [None]:
df_missing['DEP_TIME'] = df_missing['DEP_TIME'].fillna("filled")
df_missing

Moreover, we can fill missing values not only by specific column or row, as well for the whole dataframe. But in this case, we want to fill missing values by the mean of the variable.

In [None]:
df_missing = df_missing.fillna(df.mean())
df_missing

<!--NAVIGATION-->
< [Join Merge and Concatenate](02-join-merge-and-concatenate.ipynb) | [Contents](Index.ipynb) | [More Operation on Dataframe](04-more-operation-on-dataframe.ipynb) >