# Data Cleaning
https://www.miamioh.edu/cads/students/coding-tutorials/python/data-cleaning/index.html

## Step 1: Understanding the data

Before cleaning data, there are a couple of things we would like to know: for example, the dimensions of a dataset, the data type of each variable, perhaps a peek at the first few rows and last few rows of the data (to see what it looks like and confirm it matches our expectations), the name of each variable, etc.

df.shape ----- #Check the dimensions of the data sets

df.dtypes ----- #Look at the data types of each column

### Do These match the expected data types for the columns?

df.head() ----- #read the first five rows


df.tail() ----- #read the last five rows


df.columns.values ----- # retuns an array of column names

df.columns.values.tolist() ----- # returns a list of column names

## Step 2: Check missing values
Next, we would like to check if there are any missing values. To check this, we can use the function dataframe.isnull() in pandas. It will return True for missing components and False for non-missing cells. However, when the dimension of a dataset is large, it could be difficult to figure out the existence of missing values. In general, we may just want to know if there are any missing values at all before we try to find where they are. The function dataframe.isnull().values.any() returns True when there is at least one missing value occurring in the data. The function dataframe.isnull().sum().sum() returns the number of missing values in the dataset.

df.isnull() #Check missing values

df.notnull() # Check non missing values

df.isnull().values.any() # Check if there is missing values

df.isnull().sum() #Check missing by variable for each column

df.isnull().sum().sum() #Check how many missing values in data

df[df["col_1"].notnull()] # Returns the rows where col_1 is not null

df[df["col_1"].notnull()] & df["col_2"].notnull()] #Returns the rows where col_1 and col_2 are not null

## Get Information about missing values

We could subset the data based on the missing values and create a new data frame to hold all the rows.

no_missing = df.dropna() #drop all rows that has missing values and assign the data to no_missing.

no_missing = no_missing.reset_index(drop = True) # This resets the index back to zero.

You can also set a threshold of missing values. In the below example it drops rows that contain less than 50 non-missing values.

Threshold_missing = df.dropna(thresh = 50) #Find out more on this


If we use dataframe.dropna(thresh=25) to drop rows that contain less than 25 non-missing values, we don't change the original data. We can assign the output to a new variable or save the changes to the original data right away by using 

dataframe.dropna(thresh=25, inplace=True). For our example, it would be 

df.dropna(thresh=25, inplace=True).

## Step 4: Fill in Missing Values
For quantitative variables, we may replace missing values with the sample mean, mode, median, or other numbers. For categorical variables, we can create a new category for missing values by replacing missing values with a string.

Replace missing values with 0.

fill_no = df.fillna(0)        #Fill in missing with 0 and save the data to fill_no

df['DataFrame Column'] = df['DataFrame Column'].fillna(0)  #fill in missing for a singular column

fill_no.head()

Replace all missing with replace missing

df.fillna("missing") # fill in missing with a string: "missing" and save the data to Fill_str

df["col_1"].fillna(df['col_1'].mean(), inplace = True) # fill missing values with the sample mean

## Step 5: Dropping Data
We may want to drop duplicate rows if any and save the changes to the original data.


df.duplicated().any() ----- #Check for duplicates

df.duplicated(keep = '') #(Keep = first, last, False) #Check which index has duplicated value

df.drop_duplicates(inplace=True)

### We also want to drop some observations or some columns.

df = df.drop(df.index[[1,2,3]]) #Drop the first second and third row

df = df.drop(df.index[range(1,11)]) #drop the 2nd to 10th row

df = df.drop(['col_1'], axis = 1) # drop the column 1 from the data

df = df.drop([['col_1', 'col_3']], axis = 1) # Drop the col_1 and col_3  from the data set

## Step 6: Subsetting
iloc stands for integer location. It helps subset data by using integers. It’s counterpart loc uses strings to find data within your data set.

df.iloc[0] # Show the first row of information

df.iloc[[0, 1, 2]] # Show the first, second, and third row

df.iloc[:,0] #print the first column

df.iloc[:,0:5] #print the first five column

df.iloc[0:5, 0:3] # a subset of the first 3 columns and 5 rows

df.loc[0] # show the first row

df[['col_1', 'col_3', 'col_4']] #Subset of col_1, col_3 and col_4

df.sample(n=100) #Random sample of 100 rows

df.sample(frac=0.1, replace=True) # A random sample of 10% of the data 
with replacement