<a href="https://colab.research.google.com/github/mlakireddy-cds/sample/blob/main/M0_Mini_Project_03_Literacy_Rate_Prediction_ArvinderShinh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification in Computational Data Science
## A Program by IISc and TalentSprint
### Mini-Project (Ungraded)



## Learning Objective

At the end of this experiment, you will be able to:

* perform Data preprocessing
* implement ML classification algorithms

## Problem Statement

We will be using district wise demographics, enrollments, and teacher indicator data to predict whether the literacy rate is high/ medium/ low in each district.

### Data Preprocessing

Data preprocessing is an important step in solving every machine learning problem. Most of
the datasets used with Machine Learning problems need to be processed / cleaned / transformed
so that a Machine Learning algorithm can be trained on it.

There are different steps involved in Data Preprocessing. These steps are as follows:

    1. Data Cleaning → In this step the primary focus is on
        - Handling missing data
        - Handling noisy data
        - Detection and removal of outliers
    
    2. Data Integration → This process is used when data is gathered from various data sources
    and data are combined to form consistent data. This data after performing cleaning is used
    for analysis.
    
    3. Data Transformation → In this step we will convert the raw data into a specified format according to the need of the model we are building. There are many options used for
    transforming the data as below:
        - Normalization
        - Aggregation
        - Generalization
        
    4. Data Reduction → Following data transformation and scaling, the redundancy within the data is removed and is organized efficiently.



In [None]:
# @title Download the datasets
from IPython import get_ipython

ipython = get_ipython()
  
notebook="U1_MH1_Data_Munging" #name of the notebook

def setup():
    from IPython.display import HTML, display
    ipython.magic("sx wget https://cdn.iisc.talentsprint.com/aiml/Experiment_related_data/B15_Data_Munging.zip")
    ipython.magic("sx unzip B15_Data_Munging.zip")
    print("Data downloaded successfully")
    return

setup()

In [None]:
!ls

## Exercise 1 - Load and Explore the Data 
1. We have three different files

  * Districtwise_Basicdata.csv
  * Districtwise_Enrollment_details_indicator.csv
  * Districtwise_Teacher_indicator.csv

  These files contain the necessary data to solve the problem. <br>

2. Load the files based on **team allocation** mentioned below. Observe the header level details, data records while loading the data.
  
  Hint : Use read_csv from pandas with [skiprows or header](https://towardsdatascience.com/import-csv-files-as-pandas-dataframe-with-skiprows-skipfooter-usecols-index-col-and-header-fbf67a2f92a) options.

3. Read the columns of the dataset and rename if required.

  Hint : Rename column names (if any) using the following [link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).

Team allocation for dataset selection

    Team A = 1,3,5,7,9
        Districtwise_Basicdata.csv
        Districtwise_Enrollment_details_indicator.csv

    Team B = 2,4,6,8,10
        Districtwise_Basicdata.csv
        Districtwise_Teacher_indicator.csv

In [None]:
# Importing all the required packages and add neccesary imports if required
import pandas as pd
import numpy as np

In [None]:
# YOUR CODE HERE for loading and exploring the datasets
Districtwise_Basicdata = pd.read_csv('Districtwise_Basicdata.csv', skiprows = 1)
Districtwise_Teacher_indicator = pd.read_csv('Districtwise_Teacher_indicator.csv', skiprows = 3)

In [None]:
Districtwise_Basicdata.rename(columns=str.lower, inplace=True)
Districtwise_Teacher_indicator.rename(columns={"ac_year": "year"}, inplace=True)

print(Districtwise_Basicdata.shape)
print(Districtwise_Teacher_indicator.shape)

(1324, 19)
(1324, 181)


## Exercise 2  - Data Integration

As the required data is present in different datasets, we need to **integrate both to make a single dataframe/dataset**.
  * For integrating the datasets, create a unique identifier for each row in both the dataframes so that it can be used to map the data in different files.
   
    * Combine year, state code, district code columns and form a new unique identifier column, refer this [link](https://stackoverflow.com/questions/33098383/merge-multiple-column-values-into-one-column-in-python-pandas).
    * Set the identifier column as the index for each dataframe.

    * Integrate the dataframes using the above index
     
     Hint: For merging or joining the datasets, refer to this [link](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

**Example:** Data of the district Anantapur in Andrapradesh, which is present in different files should form a single row after integrating the datasets


In [None]:
df = pd.merge(Districtwise_Basicdata, Districtwise_Teacher_indicator, on=["year","statecd","distcd"], how="outer", validate="one_to_one", indicator=True)

## Exercise 3 - Data Cleaning 

1.  **overall_lit** is our target variable. Delete rows with missing overall_lit value

   Hint: Refer to the link [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).


2.  Convert categorical values to numerical values.

  For example, if a feature contains categorical values such as dog, cat, mouse, etc then replace them with 1, 2, 3, etc or use [Sklearn LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) 

3. Replace the missing values in any other column appropriately with mean / median / mode.

  Hint: Use pandas [fillna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) function to replace the missing values




In [None]:
df = df[df["overall_lit"].notna()]
df.shape

(1268, 198)

In [None]:
labelEncoder = {'High': 1, 'Medium': 2, 'Low': 3}
df["overall_lit"] = df["overall_lit"].map(labelEncoder)

In [None]:
df.fillna(df.mean(), inplace=True)

## Exercise 4 

1. Remove the unneccesary columns which are not contributing to the overall literacy rate

2. Verify if there are any duplicate columns and remove them.

  For example: state name and district name are same as state code and district code.

3. Make sure that the final dataframe has no null or nan values. Delete the rows with missing values.

   Hint: Verify with df.isna() for nan values in the dataframe. 

In [None]:
df_unique_cnt = df.apply(lambda s: s.drop_duplicates().count())
df_unique_cnt[df_unique_cnt == 1]

df.drop(columns=['tch_nr_p6'], inplace=True)

In [None]:
df.drop(columns=['statename_x', 'statename_y', 'distname_x', 'distname_y', '_merge'], inplace=True)

In [None]:
df.isna().any().any()

False

In [None]:
df.set_index(["year","statecd","distcd"], inplace=True)

## Exercise 5 - Apply Correlation Matrix

Correlation is a statistical technique that can show whether, and how strongly, pairs of variables are related. More number of features does not imply better accuracy. More features may lead to a decline in the accuracy and create noise in the model, if they contain any irrelevant features.

*Features with high correlation value will imply the same meaning. Hence remove the highly correlated features*

**Function Description:**

Create a function `remove_Highly_Correlated()` function, which removes highly correlated features in the dataframe.
- Creates a correlation matrix of row and column wise features
- Extracts only upper triangular matrix as correlation matrix, which will have the same values below and above the diagonal
- Removes columns which are having correlation value more than the threshold value.

In [None]:
def remove_Highly_Correlated(df, bar=0.9):
  # Creates correlation matrix
  corr = df.corr()
  
  print(corr)
  # Set Up Mask To Hide Upper Triangle
  mask = np.triu(np.ones_like(corr, dtype=bool))
  tri_df = corr.mask(mask)

  # Finding features with correlation value more than specified threshold value (bar=0.9)
  highly_cor_col = [col for col in tri_df.columns if any(tri_df[col] > bar )]
  print("length of highly correlated columns",len(highly_cor_col))
    
  print(tri_df)

  # Drop the highly correlated columns
  reduced_df = df.drop(highly_cor_col, axis = 1)
  print("shape of total data",df.shape,"shape of reduced data",reduced_df.shape)
  return reduced_df

In [None]:
df_Correlated = df.drop(columns=['overall_lit'])
df_reduced = remove_Highly_Correlated(df_Correlated)

                  blocks  clusters  villages  totschools  p_06_pop  p_urb_pop  \
blocks          1.000000  0.719836  0.480601    0.647941  0.501250   0.017008   
clusters        0.719836  1.000000  0.595567    0.757229  0.541815  -0.029617   
villages        0.480601  0.595567  1.000000    0.829773  0.615654  -0.182677   
totschools      0.647941  0.757229  0.829773    1.000000  0.719811  -0.017408   
p_06_pop        0.501250  0.541815  0.615654    0.719811  1.000000   0.108293   
...                  ...       ...       ...         ...       ...        ...   
trn_tch_f7      0.341426  0.257034  0.122937    0.217627  0.161162   0.139115   
prof_trn_tch_r  0.541519  0.673048  0.374279    0.627701  0.593879   0.341628   
prof_trn_tch_p  0.248884  0.222227  0.342820    0.286879  0.247422  -0.004178   
days_nontch    -0.114235 -0.035032  0.003515    0.003734  0.008506  -0.050434   
tch_nontch      0.259996  0.201848  0.332028    0.362174  0.294288   0.017335   

                sexratio  s

## Exercise 6

Perform Mean Correction and Standard Scaling on the data feature/column wise.

**Hint:** In order to understand the idea behind the terms used above, you may refer the following link: 

[StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
df_mean = df_reduced.apply("mean")
df_std = df_reduced.apply("std")

df_std_scaling = (df_reduced-df_mean)/df_std


In [None]:
df = pd.concat([df_std_scaling, df['overall_lit']], axis=1)

## Exercise 7

Apply different classifiers on the preprocessed data and figure out which classifier gives the best result.

* Split the data into train and test

* Fit the model with train data and find its accuracy on test data

### Expected Accuracy is above 90%

In [None]:
features = df.iloc[:,:-1].values
labels = df.iloc[:,-1].values
print(features.shape)
print(labels.shape)

(1268, 163)
(1268,)


In [None]:
# Exporting the model into a dot file
import os
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, random_state = 42, test_size = 0.2)

Implement Decision Tree on the data

In [None]:
tree_clf = tree.DecisionTreeClassifier(criterion='entropy')

In [None]:
# Fitting the data
tree_clf = tree_clf.fit(X_train,y_train)

In [None]:
# Predict the labels for test data
pred = tree_clf.predict(X_test)

In [None]:
# Calculating accuracy
accuracy_score(y_test, pred)

0.937007874015748

Implement KNN Classifier on the data

In [None]:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train,y_train)


KNeighborsClassifier(n_neighbors=3)

In [None]:
# Predict the labels for test data
y_pred = neigh.predict(X_test)

In [None]:
# Calculating accuracy
accuracy_score(y_test, y_pred)

0.7519685039370079

Implement Linear Classifier on the data

In [None]:
linear_clf = LogisticRegression(random_state=0)
linear_clf.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(random_state=0)

In [None]:
# Predict the labels for test data
pred = linear_clf.predict(X_test)

In [None]:
# Calculating accuracy
accuracy_score(y_test, pred)

0.8779527559055118