# Data Cleaning

Note these links lead to the Github wiki pages for this section

**[Transforming Data Types](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#transforming-data-types)**
* [Convert non-numeric to numeric - `to_numeric()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#convert-non-numeric-to-numeric)
* [Convert anything to anything - `astype()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#convert-anything-to-anything)

**[Missing Data](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#missing-data)**
* [Identifying Missing Data - `isnull()`, and `read_csv()` parameter `na_values`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#identifying-missing-data)
* [Replace Missing Data with values - `fillna()`, `ffill` and `bfill`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#replacing-null--na-with-values)
* [Removing Missing Data - `dropna()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#removing-missing-data-null--na)

**[Deleting / Dropping - `drop()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#deleting--dropping-columns-and-rows)**
* [Dropping columns - `del`, `drop()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#dropping-columns)
* [Dropping rows - `drop()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#dropping-rows)

**[Column Manipulations](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#column-manipulations)**
* [Rename columns - `rename()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#rename-columns)
* [Dealing with the index - `reset_index()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#dealing-with-the-index)

**[Working with Strings](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#working-with-strings)**
* [Remove trailing whitespace at the beginning, end, or both sides of your text - `lstrip()`, `rstrip()`, or `strip()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#remove-trailing-whitespace-at-the-beginning-end-or-both-sides-of-your-text-with-lstrip-rstrip-or-strip)
* [Change all letters to lowercase, `lower()`, or uppercase, `upper()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#change-all-letters-to-lowercase-lower-or-uppercase-upper)
* [Check whether elements contains a pattern using `contains()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#check-whether-elements-contains-a-pattern-using-contains)
* [Check whether elements matches a pattern exactly using `match()`](https://github.com/kn-kn/python-guide/wiki/Data-Cleaning#check-whether-elements-matches-a-pattern-exactly-using-match)


Unless specified, the default DataFrames are the following:

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/kenguyen/Downloads/DemographicData.csv')
df.head(3)

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
0,Aruba,ABW,10.244,78.9,High income
1,Afghanistan,AFG,35.253,5.9,Low income
2,Angola,AGO,45.985,19.1,Upper middle income


In [2]:
df_missing = pd.read_excel('C:/Users/kenguyen/Downloads/sample_data.xlsx')
df_missing.head(3)

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1.0,1000
1,100002000.0,197.0,LEXINGTON,N,3.0,1.5,--
2,100003000.0,,LEXINGTON,N,,1.0,850


In [3]:
raw_data = {'Numbers': ["8", 6.0, "7", "3", 0.9],'Names': ["Canada", "USA", "Brazil", "Germany", "France"]}
df2 = pd.DataFrame(raw_data, columns = ['Numbers', 'Names'])

## Transforming Data Types

A great Stack Overflow answer for this topic can be found here: [Link](https://stackoverflow.com/questions/15891038/change-data-type-of-columns-in-pandas)


To view the data types of all the columns in your dataframe, use .dtypes

In [4]:
df.dtypes

Country Name       object
Country Code       object
Birth rate        float64
Internet users    float64
Income Group       object
dtype: object

### Convert non-numeric to numeric

[`to_numeric()` Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html)

You can safely convert non-numeric types (e.g. strings) to a suitable numeric type.

**Column "Numbers" has mixed strings and numeric values**

In [5]:
df2

Unnamed: 0,Numbers,Names
0,8.0,Canada
1,6.0,USA
2,7.0,Brazil
3,3.0,Germany
4,0.9,France


**Numbers is now a column with all float values**

In [6]:
df2['Numbers'] = pd.to_numeric(df2['Numbers'])
df2

Unnamed: 0,Numbers,Names
0,8.0,Canada
1,6.0,USA
2,7.0,Brazil
3,3.0,Germany
4,0.9,France


**You can also convert multiple columns as well using the `apply()`.**

`df[['a', 'b']] = df[['a', 'b']].apply(pd.to_numeric)`

---

### Convert Anything to Anything

[`astype()` Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)

You can convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so).

**Convert column "a" to int64 type and "b" to complex type.**
* `df = df.astype({"a": int, "b": complex})`

**Convert column "a" to string.**
* `df['a'] = df['a'].astype(str)`

## Missing Data

### Identifying Missing Data

When you have values in your data that are completely missing, Python will treat them as `NaN`. Example data is shown below:

In [7]:
df_missing

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3,1,1000
1,100002000.0,197.0,LEXINGTON,N,3,1.5,--
2,100003000.0,,LEXINGTON,N,,1,850
3,100004000.0,201.0,BERKELEY,12,1,,700
4,,203.0,BERKELEY,Y,3,2,1600
5,100006000.0,207.0,BERKELEY,Y,,1,800
6,100007000.0,,WASHINGTON,,2,HURLEY,950
7,100008000.0,213.0,TREMONT,Y,1,1,
8,100009000.0,215.0,TREMONT,Y,na,2,1800


The `isnull()` function allows you to see which data points Python identifies as missing data. This function works really well when chained with other functions.

In [8]:
df_missing.isnull()

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,True,False,False,True,False,False
3,False,False,False,False,False,True,False
4,True,False,False,False,False,False,False
5,False,False,False,False,True,False,False
6,False,True,False,True,False,False,False
7,False,False,False,False,False,False,True
8,False,False,False,False,False,False,False


However notice that Python doesn't recognize every instance of missing data. It did not recognize:

* "na" in column `NUM_BEDROOMS`
* "---" in column `SQ_FT`

You can create a list of different values that can be intepreted as missing values types and add the parameter `na_values` into `pd.read_csv()`.

In [9]:
missing_values_list = ["na", "n/a", "---"]
df_missing = pd.read_excel("C:/Users/kenguyen/Downloads/sample_data.xlsx", na_values = missing_values_list)
df_missing

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1,1000
1,100002000.0,197.0,LEXINGTON,N,3.0,1.5,--
2,100003000.0,,LEXINGTON,N,,1,850
3,100004000.0,201.0,BERKELEY,12,1.0,,700
4,,203.0,BERKELEY,Y,3.0,2,1600
5,100006000.0,207.0,BERKELEY,Y,,1,800
6,100007000.0,,WASHINGTON,,2.0,HURLEY,950
7,100008000.0,213.0,TREMONT,Y,1.0,1,
8,100009000.0,215.0,TREMONT,Y,,2,1800


### Replacing NULL / NA with values

**Replace missing values of a specific column with a single value.**

`df['Column_Name'].fillna(125, inplace=True)`

**Replace missing values of a specific column with the average of the same column**

`mean = df['Revenue'].mean()`\
`df['Revenue'].fillna(mean, inplace=True)`

**Replace missing values with the next or previous value in the dataframe (usually the row above or below).**

`df.fillna(method = 'ffill') - using forward fill`\
`df.fillna(method = 'bfill') - using back fill`

---

### Removing Missing Data (NULL / NA)

**Drop rows that contains at least one missing data in any of its columns**

`df = df.dropna()`

**Drop rows if all cells in the row is missing data**

`df = df.dropna(how='all')`

**Drop a column if they only contain missing values**

`df.dropna(axis=1, how='all')`

---

## Deleting / Dropping Columns and Rows

### Dropping columns

**Use `del` to drop a single column**

In [10]:
del df['Country Code']
df.head(5)

Unnamed: 0,Country Name,Birth rate,Internet users,Income Group
0,Aruba,10.244,78.9,High income
1,Afghanistan,35.253,5.9,Low income
2,Angola,45.985,19.1,Upper middle income
3,Albania,12.877,57.2,Upper middle income
4,United Arab Emirates,11.044,88.0,High income


**Use `drop()` to drop one or more columns at once**

In [11]:
columns_list=['Birth rate', 'Internet users']
df.drop(columns=columns_list, inplace=True)
df.head(5)

Unnamed: 0,Country Name,Income Group
0,Aruba,High income
1,Afghanistan,Low income
2,Angola,Upper middle income
3,Albania,Upper middle income
4,United Arab Emirates,High income


### Dropping rows

**Specify exact row numbers to drop**

In [12]:
# Reset the data for example purposes
df = pd.read_csv('C:/Users/kenguyen/Downloads/DemographicData.csv')

# Notice the index numbers 0 and 2 are now dropped
df.drop([0, 2], inplace=True)
df.head(3)

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
1,Afghanistan,AFG,35.253,5.9,Low income
3,Albania,ALB,12.877,57.2,Upper middle income
4,United Arab Emirates,ARE,11.044,88.0,High income


**Drop rows containing a specific value or specified condition**

**1. Assigning the result to a variable**

`df = df.drop(df[df['Internet users'] > 50].index)`

**2. Or dropping in place**

`df.drop(df[df['Internet users'] > 50].index, inplace=True)`

In [13]:
df.drop(df[df['Internet users'] > 50].index, inplace=True)
df.head(5)

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
1,Afghanistan,AFG,35.253,5.9,Low income
6,Armenia,ARM,13.308,41.9,Lower middle income
11,Burundi,BDI,44.151,1.3,Low income
13,Benin,BEN,36.44,4.9,Low income
14,Burkina Faso,BFA,40.551,9.1,Low income


**Drop all rows that contain a certain strings**

In [14]:
country_list = ['Aruba', 'Afghanistan']
df = df[~df['Country Name'].isin(country_list)]
df.head(5)

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
6,Armenia,ARM,13.308,41.9,Lower middle income
11,Burundi,BDI,44.151,1.3,Low income
13,Benin,BEN,36.44,4.9,Low income
14,Burkina Faso,BFA,40.551,9.1,Low income
15,Bangladesh,BGD,20.142,6.63,Lower middle income


## Column Manipulations

### Rename columns

Use dictionary key-value pairs to rename your columns. You can either add them to a dictionary or just specify them within `rename()`.

Specify within `rename()`:

`df = df.rename(columns={'Country Name': 'ABCD', 'Country Code': '_~2~_', 'Birth rate': 'HELLO'})`

OR

Add to dictionary:

`new_names = {'Country Name': 'ABCD', 'Country Code': '+2~_', 'Birth rate': 'HELLO'}`

`df.rename(columns=new_names, inplace=True)`

In [15]:
new_names = {'Country Name': 'ABCD', 'Country Code': '+2~_', 'Birth rate': 'HELLO'}
df.rename(columns=new_names, inplace=True)
df.head(5)

Unnamed: 0,ABCD,+2~_,HELLO,Internet users,Income Group
6,Armenia,ARM,13.308,41.9,Lower middle income
11,Burundi,BDI,44.151,1.3,Low income
13,Benin,BEN,36.44,4.9,Low income
14,Burkina Faso,BFA,40.551,9.1,Low income
15,Bangladesh,BGD,20.142,6.63,Lower middle income


### Dealing with the index

**Rename values in the index**

In [16]:
# Reset the data for example purposes
df = pd.read_csv('C:/Users/kenguyen/Downloads/DemographicData.csv')

df.rename(index={0: 'wow', 1:'omg', 2: 'unbelievable'}, inplace=True)
df.head(3)

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
wow,Aruba,ABW,10.244,78.9,High income
omg,Afghanistan,AFG,35.253,5.9,Low income
unbelievable,Angola,AGO,45.985,19.1,Upper middle income


**Reset the index to default**

[`reset_index()` Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html)

In [17]:
df.reset_index(drop=True, inplace=True)
df.head(5)

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
0,Aruba,ABW,10.244,78.9,High income
1,Afghanistan,AFG,35.253,5.9,Low income
2,Angola,AGO,45.985,19.1,Upper middle income
3,Albania,ALB,12.877,57.2,Upper middle income
4,United Arab Emirates,ARE,11.044,88.0,High income


**Move the index into a column and rename it simultaneously**

In [18]:
df.reset_index().rename(columns={'index': 'H3LLO_W0RLD!!'}).head(3)

Unnamed: 0,H3LLO_W0RLD!!,Country Name,Country Code,Birth rate,Internet users,Income Group
0,0,Aruba,ABW,10.244,78.9,High income
1,1,Afghanistan,AFG,35.253,5.9,Low income
2,2,Angola,AGO,45.985,19.1,Upper middle income


By default, using `reset_index()` will reset the index and the old index will become a new column in your dataframe. If you do not want the old index as a column, then add the `drop=True` parameter to `reset_index()`.

## Working with Strings

[Documentation on working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)

These are just a few of the many ways to work with strings. A quick Google search can aid you in you in your specific situation.

#### Remove trailing whitespace at the beginning, end, or both sides of your text with `lstrip()`, `rstrip()`, or `strip()`

`df['Your_columns'].str.strip()`

`df['Your_columns'].str.lstrip()`

`df['Your_columns'].str.rstrip()`

#### Change all letters to lowercase, `lower()`, or uppercase, `upper()`

`df['Your_columns'].str.lower()`

`df['Your_columns'].str.upper()`

#### Check whether elements contains a pattern using `contains()`

This checks whether the Country Name has a word starts with a "C" in its text:

In [19]:
pattern = r'[C][a-z]'
df[df['Country Name'].str.contains(pattern)]

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
29,Central African Republic,CAF,34.076,3.5,Low income
30,Canada,CAN,10.9,85.8,High income
32,Chile,CHL,13.385,66.5,High income
33,China,CHN,12.1,45.8,Upper middle income
34,Cote d'Ivoire,CIV,37.32,8.4,Lower middle income
35,Cameroon,CMR,37.236,6.4,Lower middle income
36,"Congo, Rep.",COG,37.011,6.6,Lower middle income
37,Colombia,COL,16.076,51.7,Upper middle income
38,Comoros,COM,34.326,6.5,Low income
39,Cabo Verde,CPV,21.625,37.5,Lower middle income


#### Check whether elements matches a pattern exactly using `match()`

Unlike above, this checks whether the Country Name matches the exact pattern; the beginning of the Country Name **must** start with "C" in its text:

In [20]:
pattern = r'[C][a-z]'
df[df['Country Name'].str.match(pattern)]

Unnamed: 0,Country Name,Country Code,Birth rate,Internet users,Income Group
29,Central African Republic,CAF,34.076,3.5,Low income
30,Canada,CAN,10.9,85.8,High income
32,Chile,CHL,13.385,66.5,High income
33,China,CHN,12.1,45.8,Upper middle income
34,Cote d'Ivoire,CIV,37.32,8.4,Lower middle income
35,Cameroon,CMR,37.236,6.4,Lower middle income
36,"Congo, Rep.",COG,37.011,6.6,Lower middle income
37,Colombia,COL,16.076,51.7,Upper middle income
38,Comoros,COM,34.326,6.5,Low income
39,Cabo Verde,CPV,21.625,37.5,Lower middle income
