<a href="https://colab.research.google.com/github/hemanthreddy3741/hemanthreddy3741/blob/master/Logistic_regression_on_donors_choose.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DonorsChoose

<p>
DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
</p>
<p>
    Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
<ul>
<li>
    How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible</li>
    <li>How to increase the consistency of project vetting across different volunteers to improve the experience for teachers</li>
    <li>How to focus volunteer time on the applications that need the most assistance</li>
    </ul>
</p>    
<p>
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
</p>

## About the DonorsChoose Data Set

The `train.csv` data set provided by DonorsChoose contains the following features:

Feature | Description 
----------|---------------
**`project_id`** | A unique identifier for the proposed project. **Example:** `p036502`   
**`project_title`**    | Title of the project. **Examples:**<br><ul><li><code>Art Will Make You Happy!</code></li><li><code>First Grade Fun</code></li></ul> 
**`project_grade_category`** | Grade level of students for which the project is targeted. One of the following enumerated values: <br/><ul><li><code>Grades PreK-2</code></li><li><code>Grades 3-5</code></li><li><code>Grades 6-8</code></li><li><code>Grades 9-12</code></li></ul>  
 **`project_subject_categories`** | One or more (comma-separated) subject categories for the project from the following enumerated list of values:  <br/><ul><li><code>Applied Learning</code></li><li><code>Care &amp; Hunger</code></li><li><code>Health &amp; Sports</code></li><li><code>History &amp; Civics</code></li><li><code>Literacy &amp; Language</code></li><li><code>Math &amp; Science</code></li><li><code>Music &amp; The Arts</code></li><li><code>Special Needs</code></li><li><code>Warmth</code></li></ul><br/> **Examples:** <br/><ul><li><code>Music &amp; The Arts</code></li><li><code>Literacy &amp; Language, Math &amp; Science</code></li>  
  **`school_state`** | State where school is located ([Two-letter U.S. postal code](https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations#Postal_codes)). **Example:** `WY`
**`project_subject_subcategories`** | One or more (comma-separated) subject subcategories for the project. **Examples:** <br/><ul><li><code>Literacy</code></li><li><code>Literature &amp; Writing, Social Sciences</code></li></ul> 
**`project_resource_summary`** | An explanation of the resources needed for the project. **Example:** <br/><ul><li><code>My students need hands on literacy materials to manage sensory needs!</code</li></ul> 
**`project_essay_1`**    | First application essay<sup>*</sup>  
**`project_essay_2`**    | Second application essay<sup>*</sup> 
**`project_essay_3`**    | Third application essay<sup>*</sup> 
**`project_essay_4`**    | Fourth application essay<sup>*</sup> 
**`project_submitted_datetime`** | Datetime when project application was submitted. **Example:** `2016-04-28 12:43:56.245`   
**`teacher_id`** | A unique identifier for the teacher of the proposed project. **Example:** `bdf8baa8fedef6bfeec7ae4ff1c15c56`  
**`teacher_prefix`** | Teacher's title. One of the following enumerated values: <br/><ul><li><code>nan</code></li><li><code>Dr.</code></li><li><code>Mr.</code></li><li><code>Mrs.</code></li><li><code>Ms.</code></li><li><code>Teacher.</code></li></ul>  
**`teacher_number_of_previously_posted_projects`** | Number of project applications previously submitted by the same teacher. **Example:** `2` 

<sup>*</sup> See the section <b>Notes on the Essay Data</b> for more details about these features.

Additionally, the `resources.csv` data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:

Feature | Description 
----------|---------------
**`id`** | A `project_id` value from the `train.csv` file.  **Example:** `p036502`   
**`description`** | Desciption of the resource. **Example:** `Tenor Saxophone Reeds, Box of 25`   
**`quantity`** | Quantity of the resource required. **Example:** `3`   
**`price`** | Price of the resource required. **Example:** `9.95`   

**Note:** Many projects require multiple resources. The `id` value corresponds to a `project_id` in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):

Label | Description
----------|---------------
`project_is_approved` | A binary flag indicating whether DonorsChoose approved the project. A value of `0` indicates the project was not approved, and a value of `1` indicates the project was approved.

### Notes on the Essay Data

<ul>
Prior to May 17, 2016, the prompts for the essays were as follows:
<li>__project_essay_1:__ "Introduce us to your classroom"</li>
<li>__project_essay_2:__ "Tell us more about your students"</li>
<li>__project_essay_3:__ "Describe how your students will use the materials you're requesting"</li>
<li>__project_essay_3:__ "Close by sharing why your project will make a difference"</li>
</ul>


<ul>
Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:<br>
<li>__project_essay_1:__ "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."</li>
<li>__project_essay_2:__ "About your project: How will these materials make a difference in your students' learning and improve their school lives?"</li>
<br>For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be NaN.
</ul>


In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from tqdm import tqdm
import os


from collections import Counter

## Reading Data

In [None]:
project_data = pd.read_csv('train_data.csv',nrows=70000)
resource_data = pd.read_csv('resources.csv')

FileNotFoundError: ignored

In [None]:
project_data.isnull().sum()

In [None]:
project_data.dropna(subset = ['teacher_prefix'], inplace=True)

In [None]:
project_data.isnull().sum()

In [None]:
y = project_data['project_is_approved'].values
project_data.drop(['project_is_approved'], axis=1, inplace=True)

In [None]:
print("Number of data points in entire data", project_data.shape)
print('-'*50)
print("The attributes of data :", project_data.columns.values)

In [None]:
print("Number of data points in entire data", resource_data.shape)
print(resource_data.columns.values)
resource_data.head(2)

## Preprocessing of `project_subject_categories`

In [None]:
catogories = list(project_data['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_') # we are replacing the & value into 
    cat_list.append(temp.strip())
    
project_data['clean_categories'] = cat_list
project_data.drop(['project_subject_categories'], axis=1, inplace=True)

from collections import Counter
my_counter = Counter()
for word in project_data['clean_categories'].values:
    my_counter.update(word.split())

cat_dict = dict(my_counter)
sorted_cat_dict = dict(sorted(cat_dict.items(), key=lambda kv: kv[1]))

##  Preprocessing of `project_subject_subcategories`

In [None]:
sub_catogories = list(project_data['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

sub_cat_list = []
for i in sub_catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    sub_cat_list.append(temp.strip())

project_data['clean_subcategories'] = sub_cat_list
project_data.drop(['project_subject_subcategories'], axis=1, inplace=True)

# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
my_counter = Counter()
for word in project_data['clean_subcategories'].values:
    my_counter.update(word.split())
    
sub_cat_dict = dict(my_counter)
sorted_sub_cat_dict = dict(sorted(sub_cat_dict.items(), key=lambda kv: kv[1]))

##  Text preprocessing

In [None]:
# merge two column text dataframe: 
project_data["essay"] = project_data["project_essay_1"].map(str) +\
                        project_data["project_essay_2"].map(str) + \
                        project_data["project_essay_3"].map(str) + \
                        project_data["project_essay_4"].map(str)

In [None]:
# Dropping the other 4 columns related in project essay
project_data.drop(['project_essay_1', 'project_essay_2', 'project_essay_3', 'project_essay_4'], axis=1, inplace=True)

In [None]:
project_data.head(2)

In [None]:
# printing some random reviews
print(project_data['essay'].values[0])
print("="*50)
print(project_data['essay'].values[150])
print("="*50)
print(project_data['essay'].values[1000])
print("="*50)
print(project_data['essay'].values[20000])
print("="*50)
print(project_data['essay'].values[29500])
print("="*50)

In [None]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    phrase = re.sub(r"%", " percent", phrase)
    phrase = re.sub("nannan",' ', phrase)# Found this pattern in some essays which adds no meaning
    return phrase

In [None]:
sent = decontracted(project_data['essay'].values[20000])
print(sent)
print("="*50)

In [None]:
# \r \n \t remove from string python: http://texthandler.com/info/remove-line-breaks-python/
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
print(sent)

In [None]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
print(sent)

In [None]:
project_data.shape

In [None]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]

In [None]:
# Combining all the above snippets 
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentence in tqdm(project_data['essay'].values):
    sent = decontracted(sentence)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    preprocessed_essays.append(sent.lower().strip())

In [None]:
# after preprocesing
preprocessed_essays[20000]

<h2><font color='black'>  Preprocessing of `project_title`</font></h2>

In [None]:
# similarly preprocessing the titles also
preprocessed_title = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data['project_title'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e not in stopwords)
    preprocessed_title.append(sent.lower().strip())

In [None]:
project_data['project_title'] = preprocessed_title

In [None]:
#Removing '.' from teacher prefix(as a process of text preprocessing)
project_data['teacher_prefix']=project_data['teacher_prefix'].str.replace('\.','',regex=True).astype(str)

In [None]:
project_data['teacher_prefix'].isna().any()

In [None]:
project_data['teacher_prefix'].value_counts()

In [None]:
# https://stackoverflow.com/questions/22407798/how-to-reset-a-dataframes-indexes-for-all-groups-in-one-step
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
price_data.head(2)

In [None]:
# join two dataframes in python: 
project_data = pd.merge(project_data, price_data, on='id', how='left')

In [None]:
project_data['price'].isnull().any()

In [None]:
#Removing '-' from teacher prefix(as a process of text preprocessing)
project_data['project_grade_category'] = project_data['project_grade_category'].str.replace('\s+', '_')
project_data['project_grade_category'] = project_data['project_grade_category'].str.replace('-', '_')

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
m = []
def senti(i):
    l = []
    sid = SentimentIntensityAnalyzer()
    ss = sid.polarity_scores(i)
    for k in ss:
        l.append(ss[k])
    return l

In [None]:
project_data['text'] = pd.DataFrame(preprocessed_essays)

In [None]:
# Googled for this
import swifter
df2 = project_data['text'].swifter.apply(lambda x : senti(x))

In [None]:
senti_score_essay = pd.DataFrame(df2.values.tolist(), columns=['neg','neu','pos','compound'])

In [None]:
#https://stackoverflow.com/questions/23891575/how-to-merge-two-dataframes-side-by-side
project_data = pd.concat([project_data, senti_score_essay], axis=1)

In [None]:
#https://stackoverflow.com/questions/34962104/pandas-how-can-i-use-the-apply-function-for-a-single-column
a=project_data['project_title'].apply(lambda x : len(x))

In [None]:
project_data['now_title'] = pd.DataFrame(a)

In [None]:
#https://stackoverflow.com/questions/34962104/pandas-how-can-i-use-the-apply-function-for-a-single-column
b=project_data['text'].apply(lambda x : len(x))

In [None]:
project_data['now_text'] = pd.DataFrame(b)

##  Preparing data for models

In [None]:
project_data.columns

we are going to consider

       - school_state : categorical data
       - clean_categories : categorical data
       - clean_subcategories : categorical data
       - project_grade_category : categorical data
       - teacher_prefix : categorical data
       
       - project_title : text data
       - text : text data
       - project_resource_summary: text data (optional)
       
       - quantity : numerical (optional)
       - teacher_number_of_previously_posted_projects : numerical
       - price : numerical

In [None]:
final_features = ['school_state', 'clean_categories', 'clean_subcategories', 'text', 'project_grade_category', 'teacher_prefix', 'project_title', 'teacher_number_of_previously_posted_projects', 'price','quantity','now_title','now_text','neg','neu','pos','compound']

In [None]:
project_data1 = project_data[final_features].copy()

In [None]:
project_data1.columns

In [None]:
X = project_data1.copy()

In [None]:
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y, random_state=42)
X_train,X_cv,y_train,y_cv=train_test_split(X_train,y_train,test_size=0.33,stratify=y_train, random_state=42)

In [None]:
print("Shape of X_train",X_train.shape)
print("Shape of y_train",y_train.shape)
print('\n')
print("Shape of X_cv",X_cv.shape)
print("Shape of y_cv",y_cv.shape)
print('\n')
print("Shape of X_test",X_test.shape)
print("Shape of y_test",y_test.shape)

###  Vectorizing Categorical data

<h3> Encoding categorical features: school_state</h3>

In [None]:
# we use count vectorizer to convert the values into one 
vectorizer = CountVectorizer()
vectorizer.fit(X_train['school_state'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_state_ohe = vectorizer.transform(X_train['school_state'].values)
X_cv_state_ohe = vectorizer.transform(X_cv['school_state'].values)
X_test_state_ohe = vectorizer.transform(X_test['school_state'].values)

print("After vectorizations")
print(X_train_state_ohe.shape, y_train.shape)
print(X_cv_state_ohe.shape, y_cv.shape)
print(X_test_state_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)

<h3> Encoding categorical features: teacher_prefix</h3>

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train['teacher_prefix'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_teacher_ohe = vectorizer.transform(X_train['teacher_prefix'].values)
X_cv_teacher_ohe = vectorizer.transform(X_cv['teacher_prefix'].values)
X_test_teacher_ohe = vectorizer.transform(X_test['teacher_prefix'].values.astype(str))

print("After vectorizations")
print(X_train_teacher_ohe.shape, y_train.shape)
print(X_cv_teacher_ohe.shape, y_cv.shape)
print(X_test_teacher_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)

<h3> Encoding categorical features: project_grade_category</h3>

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train['project_grade_category'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_grade_ohe = vectorizer.transform(X_train['project_grade_category'].values)
X_cv_grade_ohe = vectorizer.transform(X_cv['project_grade_category'].values)
X_test_grade_ohe = vectorizer.transform(X_test['project_grade_category'].values)

print("After vectorizations")
print(X_train_grade_ohe.shape, y_train.shape)
print(X_cv_grade_ohe.shape, y_cv.shape)
print(X_test_grade_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)

<h3> Encoding categorical features: clean_subcategories</h3>

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train['clean_subcategories'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_subcat_ohe = vectorizer.transform(X_train['clean_subcategories'].values)
X_cv_subcat_ohe = vectorizer.transform(X_cv['clean_subcategories'].values)
X_test_subcat_ohe = vectorizer.transform(X_test['clean_subcategories'].values)

print("After vectorizations")
print(X_train_subcat_ohe.shape, y_train.shape)
print(X_cv_subcat_ohe.shape, y_cv.shape)
print(X_test_subcat_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)

<h3> Encoding categorical features: clean_categories</h3>

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train['clean_categories'].values) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector
X_train_cat_ohe = vectorizer.transform(X_train['clean_categories'].values)
X_cv_cat_ohe = vectorizer.transform(X_cv['clean_categories'].values)
X_test_cat_ohe = vectorizer.transform(X_test['clean_categories'].values)

print("After vectorizations")
print(X_train_cat_ohe.shape, y_train.shape)
print(X_cv_cat_ohe.shape, y_cv.shape)
print(X_test_cat_ohe.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)

In [None]:
<h3>Encoding numerical features: Price</h3>

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(X_train['price'].values.reshape(1,-1))

X_train_price_norm = normalizer.transform(X_train['price'].values.reshape(1,-1))
X_cv_price_norm = normalizer.transform(X_cv['price'].values.reshape(1,-1))
X_test_price_norm = normalizer.transform(X_test['price'].values.reshape(1,-1))

X_train_price_norm = X_train_price_norm.reshape(-1,1)
X_cv_price_norm = X_cv_price_norm.reshape(-1,1)
X_test_price_norm = X_test_price_norm.reshape(-1,1)

print("After vectorizations")
print(X_train_price_norm.shape, y_train.shape)
print(X_cv_price_norm.shape, y_cv.shape)
print(X_test_price_norm.shape, y_test.shape)
print("="*100)



```
# This is formatted as code
```

<h3>Encoding numerical features: Quantity</h3>

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(X_train['quantity'].values.reshape(1,-1))

X_train_quan_norm = normalizer.transform(X_train['quantity'].values.reshape(1,-1))
X_cv_quan_norm = normalizer.transform(X_cv['quantity'].values.reshape(1,-1))
X_test_quan_norm = normalizer.transform(X_test['quantity'].values.reshape(1,-1))

X_train_quan_norm = X_train_quan_norm.reshape(-1,1)
X_cv_quan_norm = X_cv_quan_norm.reshape(-1,1)
X_test_quan_norm = X_test_quan_norm.reshape(-1,1)

print("After vectorizations")
print(X_train_quan_norm.shape, y_train.shape)
print(X_cv_quan_norm.shape, y_cv.shape)
print(X_test_quan_norm.shape, y_test.shape)
print("="*100)

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(X_train['now_title'].values.reshape(1,-1))

X_train_now_title_norm = normalizer.transform(X_train['now_title'].values.reshape(1,-1))
X_cv_now_title_norm = normalizer.transform(X_cv['now_title'].values.reshape(1,-1))
X_test_now_title_norm = normalizer.transform(X_test['now_title'].values.reshape(1,-1))

X_train_now_title_norm = X_train_now_title_norm.reshape(-1,1)
X_cv_now_title_norm = X_cv_now_title_norm.reshape(-1,1)
X_test_now_title_norm = X_test_now_title_norm.reshape(-1,1)

print("After vectorizations")
print(X_train_now_title_norm.shape, y_train.shape)
print(X_cv_now_title_norm.shape, y_cv.shape)
print(X_test_now_title_norm.shape, y_test.shape)
print("="*100)

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(X_train['now_text'].values.reshape(1,-1))

X_train_now_text_norm = normalizer.transform(X_train['now_text'].values.reshape(1,-1))
X_cv_now_text_norm = normalizer.transform(X_cv['now_text'].values.reshape(1,-1))
X_test_now_text_norm = normalizer.transform(X_test['now_text'].values.reshape(1,-1))

X_train_now_text_norm = X_train_now_text_norm.reshape(-1,1)
X_cv_now_text_norm = X_cv_now_text_norm.reshape(-1,1)
X_test_now_text_norm = X_test_now_text_norm.reshape(-1,1)

print("After vectorizations")
print(X_train_now_text_norm.shape, y_train.shape)
print(X_cv_now_text_norm.shape, y_cv.shape)
print(X_test_now_text_norm.shape, y_test.shape)
print("="*100)

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(X_train['neg'].values.reshape(1,-1))

X_train_neg_norm = normalizer.transform(X_train['neg'].values.reshape(1,-1))
X_cv_neg_norm = normalizer.transform(X_cv['neg'].values.reshape(1,-1))
X_test_neg_norm = normalizer.transform(X_test['neg'].values.reshape(1,-1))

X_train_neg_norm = X_train_neg_norm.reshape(-1,1)
X_cv_neg_norm = X_cv_neg_norm.reshape(-1,1)
X_test_neg_norm = X_test_neg_norm.reshape(-1,1)

print("After vectorizations")
print(X_train_neg_norm.shape, y_train.shape)
print(X_test_neg_norm.shape, y_cv.shape)
print(X_cv_neg_norm.shape, y_test.shape)
print("="*100)

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(X_train['pos'].values.reshape(1,-1))

X_train_pos_norm = normalizer.transform(X_train['pos'].values.reshape(1,-1))
X_cv_pos_norm = normalizer.transform(X_cv['pos'].values.reshape(1,-1))
X_test_pos_norm = normalizer.transform(X_test['pos'].values.reshape(1,-1))

X_train_pos_norm = X_train_pos_norm.reshape(-1,1)
X_cv_pos_norm = X_cv_pos_norm.reshape(-1,1)
X_test_pos_norm = X_test_pos_norm.reshape(-1,1)

print("After vectorizations")
print(X_train_pos_norm.shape, y_train.shape)
print(X_test_pos_norm.shape, y_cv.shape)
print(X_cv_pos_norm.shape, y_test.shape)
print("="*100)

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(X_train['neg'].values.reshape(1,-1))

X_train_neu_norm = normalizer.transform(X_train['neu'].values.reshape(1,-1))
X_cv_neu_norm = normalizer.transform(X_cv['neu'].values.reshape(1,-1))
X_test_neu_norm = normalizer.transform(X_test['neu'].values.reshape(1,-1))

X_train_neu_norm = X_train_neu_norm.reshape(-1,1)
X_cv_neu_norm = X_cv_neu_norm.reshape(-1,1)
X_test_neu_norm = X_test_neu_norm.reshape(-1,1)

print("After vectorizations")
print(X_train_neu_norm.shape, y_train.shape)
print(X_cv_neu_norm.shape, y_cv.shape)
print(X_test_neu_norm.shape, y_test.shape)
print("="*100)

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(X_train['compound'].values.reshape(1,-1))

X_train_com_norm = normalizer.transform(X_train['compound'].values.reshape(1,-1))
X_cv_com_norm = normalizer.transform(X_cv['compound'].values.reshape(1,-1))
X_test_com_norm = normalizer.transform(X_test['compound'].values.reshape(1,-1))

X_train_com_norm = X_train_com_norm.reshape(-1,1)
X_cv_com_norm = X_cv_com_norm.reshape(-1,1)
X_test_com_norm = X_test_com_norm.reshape(-1,1)

print("After vectorizations")
print(X_train_com_norm.shape, y_train.shape)
print(X_test_com_norm.shape, y_cv.shape)
print(X_cv_com_norm.shape, y_test.shape)
print("="*100)

<h3>Encoding numerical features: teacher_number_of_projects</h3>

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
normalizer.fit(X_train['teacher_number_of_previously_posted_projects'].values.reshape(1,-1))

X_train_teacher_number_of_previously_posted_projects_norm = normalizer.transform(X_train['teacher_number_of_previously_posted_projects'].values.reshape(1,-1))
X_cv_teacher_number_of_previously_posted_projects_norm = normalizer.transform(X_cv['teacher_number_of_previously_posted_projects'].values.reshape(1,-1))
X_test_teacher_number_of_previously_posted_projects_norm = normalizer.transform(X_test['teacher_number_of_previously_posted_projects'].values.reshape(1,-1))

X_train_teacher_number_of_previously_posted_projects_norm = X_train_teacher_number_of_previously_posted_projects_norm.reshape(-1,1)
X_cv_teacher_number_of_previously_posted_projects_norm = X_cv_teacher_number_of_previously_posted_projects_norm.reshape(-1,1)
X_test_teacher_number_of_previously_posted_projects_norm = X_test_teacher_number_of_previously_posted_projects_norm.reshape(-1,1)

print("After vectorizations")
print(X_train_teacher_number_of_previously_posted_projects_norm.shape, y_train.shape)
print(X_cv_teacher_number_of_previously_posted_projects_norm.shape, y_cv.shape)
print(X_test_teacher_number_of_previously_posted_projects_norm.shape, y_test.shape)
print("="*100)

### Vectorizing Text data

###  Bag of words

In [None]:
vectorizer = CountVectorizer(min_df=10, ngram_range=(2,3), max_features = 5000)
vectorizer.fit(X_train['text'].values) 

# we use the fitted CountVectorizer to convert the text to vector
X_train_text_bow = vectorizer.transform(X_train['text'].values)
X_cv_text_bow = vectorizer.transform(X_cv['text'].values)
X_test_text_bow = vectorizer.transform(X_test['text'].values)

print("After vectorizations")
print(X_train_text_bow.shape, y_train.shape)
print(X_cv_text_bow.shape, y_cv.shape)
print(X_test_text_bow.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)

In [None]:
vectorizer = CountVectorizer(min_df = 5, ngram_range=(2,2), max_features=5000)
vectorizer.fit(X_train['project_title'].values.astype('U')) # fit has to happen only on train data

# we use the fitted CountVectorizer to convert the text to vector

X_train_title_bow = vectorizer.transform(X_train['project_title'].values.astype('U'))
X_cv_title_bow = vectorizer.transform(X_cv['project_title'].values.astype('U'))
X_test_title_bow = vectorizer.transform(X_test['project_title'].values.astype('U'))

print("After vectorizations")
print(X_train_title_bow.shape, y_train.shape)
print(X_cv_title_bow.shape, y_cv.shape)
print(X_test_title_bow.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)

####  TFIDF vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=10)
vectorizer.fit(X_train['text'].values) # fit has to happen only on train data

# we use the fitted tfidfVectorizer to convert the text to vector
X_train_text_tfidf = vectorizer.transform(X_train['text'].values)
X_cv_text_tfidf = vectorizer.transform(X_cv['text'].values)
X_test_text_tfidf = vectorizer.transform(X_test['text'].values)

print("After vectorizations")
print(X_train_text_tfidf.shape, y_train.shape)
print(X_cv_text_tfidf.shape, y_cv.shape)
print(X_test_text_tfidf.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=10)
vectorizer.fit(X_train['project_title'].values.astype('U')) # fit has to happen only on train data

# we use the fitted tfidfVectorizer to convert the text to vector
X_train_title_tfidf = vectorizer.transform(X_train['project_title'].values.astype('U'))
X_cv_title_tfidf = vectorizer.transform(X_cv['project_title'].values.astype('U'))
X_test_title_tfidf = vectorizer.transform(X_test['project_title'].values.astype('U'))

print("After vectorizations")
print(X_train_title_tfidf.shape, y_train.shape)
print(X_cv_title_tfidf.shape, y_cv.shape)
print(X_test_title_tfidf.shape, y_test.shape)
print(vectorizer.get_feature_names())
print("="*100)

### Using gensim for doing word2vec

### Applying word2vec on project title

In [None]:
list_of_sentance_train=[]
for sentance in (X_train['project_title'].values):
    list_of_sentance_train.append(sentance.split())
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
w2v_model=Word2Vec(list_of_sentance_train,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

In [None]:
from tqdm import tqdm
import numpy as np
sent_vectors_train = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance_train): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_train.append(sent_vec)
X_train_title_avgw2v = np.array(sent_vectors_train)
print(X_train_title_avgw2v.shape)
print(X_train_title_avgw2v[0])

In [None]:
list_of_sentance_train=[]
for sentance in (X_cv['project_title'].values):
    list_of_sentance_train.append(sentance.split())
from tqdm import tqdm
import numpy as np
sent_vectors_train = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance_train): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_train.append(sent_vec)
X_cv_title_avgw2v = np.array(sent_vectors_train)
print(X_cv_title_avgw2v.shape)
print(X_cv_title_avgw2v[0])

In [None]:
list_of_sentance_train=[]
for sentance in (X_test['project_title'].values):
    list_of_sentance_train.append(sentance.split())
from tqdm import tqdm
import numpy as np
sent_vectors_train = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance_train): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_train.append(sent_vec)
X_test_title_avgw2v = np.array(sent_vectors_train)
print(X_test_title_avgw2v.shape)
print(X_test_title_avgw2v[0])

### Applying avg word2vec on project_essay

In [None]:
list_of_sentance_train_essay=[]
for sentance in (X_train['text'].values):
    list_of_sentance_train_essay.append(sentance.split())
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
w2v_model1=Word2Vec(list_of_sentance_train_essay,min_count=5,size=50, workers=4)
w2v_words1 = list(w2v_model1.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words1))
print("sample words ", w2v_words1[0:50])

In [None]:
from tqdm import tqdm
import numpy as np
sent_vectors_train = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance_train_essay): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words1:
            vec = w2v_model1.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_train.append(sent_vec)
X_train_text_avgw2v = np.array(sent_vectors_train)
print(X_train_text_avgw2v.shape)
print(X_train_text_avgw2v[0])

In [None]:
list_of_sentance_train_essay=[]
for sentance in (X_cv['text'].values):
    list_of_sentance_train_essay.append(sentance.split())
from tqdm import tqdm
import numpy as np
sent_vectors_train = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance_train_essay): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words1:
            vec = w2v_model1.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_train.append(sent_vec)
X_cv_text_avgw2v = np.array(sent_vectors_train)
print(X_cv_text_avgw2v.shape)
print(X_cv_text_avgw2v[0])

In [None]:
list_of_sentance_train_essay=[]
for sentance in (X_test['text'].values):
    list_of_sentance_train_essay.append(sentance.split())
from tqdm import tqdm
import numpy as np
sent_vectors_train = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance_train_essay): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words1:
            vec = w2v_model1.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_train.append(sent_vec)
X_test_text_avgw2v = np.array(sent_vectors_train)
print(X_test_text_avgw2v.shape)
print(X_test_text_avgw2v[0])

### Applying tfidf w2v on project_title

In [None]:
tfidf_model1 = TfidfVectorizer()
tfidf_model1.fit(X_train['project_title'].values)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model1.get_feature_names(), list(tfidf_model1.idf_)))
tfidf_words1 = set(tfidf_model1.get_feature_names())

In [None]:
X_train_title_tfidf_w2v = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train['project_title'].values): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in w2v_words) and (word in tfidf_words1):
            vec = w2v_model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    X_train_title_tfidf_w2v.append(vector)
print(len(X_train_title_tfidf_w2v))
print(len(X_train_title_tfidf_w2v[0]))

In [None]:
X_cv_title_tfidf_w2v = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_cv['project_title'].values): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in w2v_words) and (word in tfidf_words1):
            vec = w2v_model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    X_cv_title_tfidf_w2v.append(vector)
print(len(X_cv_title_tfidf_w2v))
print(len(X_cv_title_tfidf_w2v[0]))

In [None]:
X_test_title_tfidf_w2v = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test['project_title'].values): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in w2v_words) and (word in tfidf_words1):
            vec = w2v_model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    X_test_title_tfidf_w2v.append(vector)
print(len(X_test_title_tfidf_w2v))
print(len(X_test_title_tfidf_w2v[0]))

### Applying tfidf w2v on project_text

In [None]:
tfidf_model1 = TfidfVectorizer()
tfidf_model1.fit(X_train['text'].values)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model1.get_feature_names(), list(tfidf_model1.idf_)))
tfidf_words1 = set(tfidf_model1.get_feature_names())

In [None]:
X_train_text_tfidf_w2v = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train['text'].values): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in w2v_words1) and (word in tfidf_words1):
            vec = w2v_model1[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    X_train_text_tfidf_w2v.append(vector)
print(len(X_train_text_tfidf_w2v))
print(len(X_train_text_tfidf_w2v[0]))

In [None]:
X_cv_text_tfidf_w2v = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_cv['text'].values): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in w2v_words1) and (word in tfidf_words1):
            vec = w2v_model1[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    X_cv_text_tfidf_w2v.append(vector)
print(len(X_cv_text_tfidf_w2v))
print(len(X_cv_text_tfidf_w2v[0]))

In [None]:
X_test_text_tfidf_w2v = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test['text'].values): # for each review/sentence
    vector = np.zeros(50) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in w2v_words1) and (word in tfidf_words1):
            vec = w2v_model1[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    X_test_text_tfidf_w2v.append(vector)
print(len(X_test_text_tfidf_w2v))
print(len(X_test_text_tfidf_w2v[0]))

# Assignment 5 : Logistic regression

<ol>
    <li><strong>[Task-1] Logistic Regression(either SGDClassifier with log loss, or LogisticRegression) on these feature sets</strong>
        <ul>
            <li><font color='red'>Set 1</font>: categorical, numerical features + project_title(BOW) + preprocessed_eassay (`BOW with bi-grams` with `min_df=10` and `max_features=5000`)</li>
            <li><font color='red'>Set 2</font>: categorical, numerical features + project_title(TFIDF)+  preprocessed_eassay (`TFIDF with bi-grams` with `min_df=10` and `max_features=5000`)</li>
            <li><font color='red'>Set 3</font>: categorical, numerical features + project_title(AVG W2V)+  preprocessed_eassay (AVG W2V)</li>
            <li><font color='red'>Set 4</font>: categorical, numerical features + project_title(TFIDF W2V)+  preprocessed_essay (TFIDF W2V)</li>        </ul>
    </li>
    <br>
    <li><strong>Hyper paramter tuning (find best hyper parameters corresponding the algorithm that you choose)</strong>
        <ul>
    <li>Find the best hyper parameter which will give the maximum <a href='https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/receiver-operating-characteristic-curve-roc-curve-and-auc-1/'>AUC</a> value</li>
    <li>Find the best hyper paramter using k-fold cross validation or simple cross validation data</li>
    <li>Use gridsearch cv or randomsearch cv or you can also write your own for loops to do this task of hyperparameter tuning</li>          
        </ul>
    </li>
    <br>
    <li><strong>Representation of results</strong>
        <ul>
    <li>You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure.
    <img src='train_cv_auc.JPG' width=300px></li>
    <li>Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test.
    <img src='train_test_auc.JPG' width=300px></li>
    <li>Along with plotting ROC curve, you need to print the <a href='https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/confusion-matrix-tpr-fpr-fnr-tnr-1/'>confusion matrix</a> with predicted and original labels of test data points. Please visualize your confusion matrices using <a href='https://seaborn.pydata.org/generated/seaborn.heatmap.html'>seaborn heatmaps.
    <img src='confusion_matrix.png' width=300px></li>
        </ul>
    </li>
    <br>
    <li><strong>[Task-2] Apply Logistic Regression on the below feature set <font color='red'> Set 5 </font> by finding the best hyper parameter as suggested in step 2 and step 3.</strong>
    <li> Consider these set of features <font color='red'> Set 5 :</font>
            <ul>
                <li><strong>school_state</strong> : categorical data</li>
                <li><strong>clean_categories</strong> : categorical data</li>
                <li><strong>clean_subcategories</strong> : categorical data</li>
                <li><strong>project_grade_category</strong> :categorical data</li>
                <li><strong>teacher_prefix</strong> : categorical data</li>
                <li><strong>quantity</strong> : numerical data</li>
                <li><strong>teacher_number_of_previously_posted_projects</strong> : numerical data</li>
                <li><strong>price</strong> : numerical data</li>
                <li><strong>sentiment score's of each of the essay</strong> : numerical data</li>
                <li><strong>number of words in the title</strong> : numerical data</li>
                <li><strong>number of words in the combine essays</strong> : numerical data</li>
            </ul>
        And apply the Logistic regression on these features by finding the best hyper paramter as suggested in step 2 and step 3 <br>
    </li>
    <br>
    <li><strong>Conclusion</strong>
        <ul>
    <li>You need to summarize the results at the end of the notebook, summarize it in the table format. To print out a table please refer to this prettytable library<a href='http://zetcode.com/python/prettytable/'>  link</a> 
        <img src='summary.JPG' width=400px>
    </li>
        </ul>
</ol>

<h4><font color='red'>Note: Data Leakage</font></h4>

1. There will be an issue of data-leakage if you vectorize the entire data and then split it into train/cv/test.
2. To avoid the issue of data-leakag, make sure to split your data first and then vectorize it. 
3. While vectorizing your data, apply the method fit_transform() on you train data, and apply the method transform() on cv/test data.
4. For more details please go through this <a href='https://soundcloud.com/applied-ai-course/leakage-bow-and-tfidf'>link.</a>

<h1>Logistic regression </h1>

###  Applying Logistic regression on BOW,<font color='red'> SET 1</font>

### Merging all the above features

- we need to merge all the numerical vectors i.e catogorical, text, numerical vectors



In [None]:
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_tr = hstack((X_train_state_ohe,X_train_teacher_ohe,X_train_grade_ohe,X_train_subcat_ohe,
               X_train_cat_ohe,X_train_price_norm,X_train_teacher_number_of_previously_posted_projects_norm,
               X_train_text_bow,X_train_title_bow)).tocsr()

X_cr = hstack((X_cv_state_ohe,X_cv_teacher_ohe,X_cv_grade_ohe,X_cv_subcat_ohe,
               X_cv_cat_ohe,X_cv_price_norm,X_cv_teacher_number_of_previously_posted_projects_norm,
               X_cv_text_bow,X_cv_title_bow)).tocsr()

X_te = hstack((X_test_state_ohe,X_test_teacher_ohe,X_test_grade_ohe,X_test_subcat_ohe,
               X_test_cat_ohe,X_test_price_norm,X_test_teacher_number_of_previously_posted_projects_norm,
               X_test_text_bow,X_test_title_bow)).tocsr()

print("Final Data matrix")
print(X_tr.shape, y_train.shape)
print(X_cr.shape, y_cv.shape)
print(X_te.shape, y_test.shape)
print("="*100)

In [None]:
def batch_predict(clf,data):
    y_pred=[]
    dl=data.shape[0]-data.shape[0]%1000
    
    for i in range(0,dl,1000):
        y_pred.extend(clf.predict_proba(data[i:i+1000])[:,1])
    
    y_pred.extend(clf.predict_proba(data[dl:])[:,1])
    
    return y_pred

In [None]:
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

train_auc=[]
cv_auc=[]
alpha= [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]
b=[]
for i in tqdm(alpha):
    mb=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
    mb.fit(X_tr,y_train)
    
    y_train_pred=batch_predict(mb,X_tr)
    y_cv_pred=batch_predict(mb,X_cr)
    
    train_auc.append(roc_auc_score(y_train,y_train_pred))
    cv_auc.append(roc_auc_score(y_cv,y_cv_pred))
    
max_auc_ind_train=train_auc.index(max(train_auc))
alpha_max_auc_train=alpha[max_auc_ind_train]

print("max auc in train data:",max(train_auc))
print("alpha value for maximum AUC:",alpha_max_auc_train)
max_auc_ind_cv=cv_auc.index(max(cv_auc))
alpha_max_auc_cv=alpha[max_auc_ind_cv]
print("max auc  in cv data:",max(cv_auc))
print("alpha value for maximum AUC:",alpha_max_auc_cv)

plt.plot(alpha,train_auc,label='Train_AUC')
plt.plot(alpha,cv_auc,label='CV_AUC')
plt.scatter(alpha,cv_auc)
plt.scatter(alpha,train_auc)
#https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xscale.html
plt.xscale('log')
plt.title("AUC: ERROR plot")
plt.ylabel('AUC')
plt.xlabel('Alpha:Hyperparameter')
plt.legend()
plt.grid(1)


In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc

nb = LogisticRegression(C=0.01,class_weight='balanced',penalty='l2')
nb.fit(X_tr, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = nb.predict_proba(X_tr)[:,1]    
y_test_pred = nb.predict_proba(X_te)[:,1]

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr )))
plt.legend()
plt.xlabel("alpha: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

In [None]:
#we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba,threshold,fpr,tpr):
    t=threshold[np.argmax(tpr*(1-fpr))]
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high

    print("the maximum value of tpr*(1-fpr)",max(tpr*(1-fpr)))
    print("Threshold:",np.round(t,3))
    prediction=[]
    for i in proba:
        if i>=t:
            prediction.append(1)
        else:
            prediction.append(0)
    
    return prediction

In [None]:
import seaborn as sns
#https://stackoverflow.com/a/33158941/10967428
con_tr=confusion_matrix(y_train,predict(y_train_pred,tr_thresholds,train_fpr,train_tpr))

sns.heatmap(con_tr,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Train Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

In [None]:
#https://stackoverflow.com/a/33158941/10967428
con_te=confusion_matrix(y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_tpr))

sns.heatmap(con_te,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Test Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

### 2.4.2 Applying Logistic regression on TFIDF,<font color='red'> SET 2</font>

In [None]:
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_tr = hstack((X_train_state_ohe,X_train_teacher_ohe,X_train_grade_ohe,X_train_subcat_ohe,
               X_train_cat_ohe,X_train_price_norm,X_train_teacher_number_of_previously_posted_projects_norm,
               X_train_text_tfidf,X_train_title_tfidf)).tocsr()

X_cr = hstack((X_cv_state_ohe,X_cv_teacher_ohe,X_cv_grade_ohe,X_cv_subcat_ohe,
               X_cv_cat_ohe,X_cv_price_norm,X_cv_teacher_number_of_previously_posted_projects_norm,
               X_cv_text_tfidf,X_cv_title_tfidf)).tocsr()

X_te = hstack((X_test_state_ohe,X_test_teacher_ohe,X_test_grade_ohe,X_test_subcat_ohe,
               X_test_cat_ohe,X_test_price_norm,X_test_teacher_number_of_previously_posted_projects_norm,
               X_test_text_tfidf,X_test_title_tfidf)).tocsr()

print("Final Data matrix")
print(X_tr.shape, y_train.shape)
print(X_cr.shape, y_cv.shape)
print(X_te.shape, y_test.shape)
print("="*100)

In [None]:
import matplotlib.pyplot as plt 
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

train_auc=[]
cv_auc=[]
alpha= [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]
b=[]
for i in tqdm(alpha):
    mb=MultinomialNB(alpha=i)
    mb.fit(X_tr,y_train)
    
    y_train_pred=batch_predict(mb,X_tr)
    y_cv_pred=batch_predict(mb,X_cr)
    
    train_auc.append(roc_auc_score(y_train,y_train_pred))
    cv_auc.append(roc_auc_score(y_cv,y_cv_pred))
    
max_auc_ind_train=train_auc.index(max(train_auc))
alpha_max_auc_train=alpha[max_auc_ind_train]

print("max auc in train data:",max(train_auc))
print("alpha value for maximum AUC:",alpha_max_auc_train)
max_auc_ind_cv=cv_auc.index(max(cv_auc))
alpha_max_auc_cv=alpha[max_auc_ind_cv]
print("max auc  in cv data:",max(cv_auc))
print("alpha value for maximum AUC:",alpha_max_auc_cv)

plt.plot(alpha,train_auc,label='Train_AUC')
plt.plot(alpha,cv_auc,label='CV_AUC')
plt.scatter(alpha,cv_auc)
plt.scatter(alpha,train_auc)
#https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xscale.html
plt.xscale('log')
plt.title("AUC: ERROR plot")
plt.ylabel('AUC')
plt.xlabel('Alpha:Hyperparameter')
plt.legend()
plt.grid(1)


In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc

nb = LogisticRegression(C=0.1,class_weight='balanced',penalty='l2')
nb.fit(X_tr, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = nb.predict_proba(X_tr)[:,1]    
y_test_pred = nb.predict_proba(X_te)[:,1]

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr )))
plt.legend()
plt.xlabel("alpha: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

In [None]:
import seaborn as sns
#https://stackoverflow.com/a/33158941/10967428
con_tr=confusion_matrix(y_train,predict(y_train_pred,tr_thresholds,train_fpr,train_tpr))

sns.heatmap(con_tr,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Train Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

In [None]:
import seaborn as sns
#https://stackoverflow.com/a/33158941/10967428
con_te=confusion_matrix(y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_tpr))

sns.heatmap(con_te,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Test Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

### 2.4.2 Applying Logistic regression on avgw2v,<font color='red'> SET 3</font>

In [None]:
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_tr = hstack((X_train_state_ohe,X_train_teacher_ohe,X_train_grade_ohe,X_train_subcat_ohe,
               X_train_cat_ohe,X_train_price_norm,X_train_teacher_number_of_previously_posted_projects_norm,
               X_train_text_avgw2v,X_train_title_avgw2v)).tocsr()

X_cr = hstack((X_cv_state_ohe,X_cv_teacher_ohe,X_cv_grade_ohe,X_cv_subcat_ohe,
               X_cv_cat_ohe,X_cv_price_norm,X_cv_teacher_number_of_previously_posted_projects_norm,
               X_cv_text_avgw2v,X_cv_title_avgw2v)).tocsr()

X_te = hstack((X_test_state_ohe,X_test_teacher_ohe,X_test_grade_ohe,X_test_subcat_ohe,
               X_test_cat_ohe,X_test_price_norm,X_test_teacher_number_of_previously_posted_projects_norm,
               X_test_text_avgw2v,X_test_title_avgw2v)).tocsr()

print("Final Data matrix")
print(X_tr.shape, y_train.shape)
print(X_cr.shape, y_cv.shape)
print(X_te.shape, y_test.shape)
print("="*100)

In [None]:
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

train_auc=[]
cv_auc=[]
alpha= [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 10**5]
b=[]
for i in tqdm(alpha):
    mb=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
    mb.fit(X_tr,y_train)
    
    y_train_pred=batch_predict(mb,X_tr)
    y_cv_pred=batch_predict(mb,X_cr)
    
    train_auc.append(roc_auc_score(y_train,y_train_pred))
    cv_auc.append(roc_auc_score(y_cv,y_cv_pred))
    
max_auc_ind_train=train_auc.index(max(train_auc))
alpha_max_auc_train=alpha[max_auc_ind_train]

print("max auc in train data:",max(train_auc))
print("alpha value for maximum AUC:",alpha_max_auc_train)
max_auc_ind_cv=cv_auc.index(max(cv_auc))
alpha_max_auc_cv=alpha[max_auc_ind_cv]
print("max auc  in cv data:",max(cv_auc))
print("alpha value for maximum AUC:",alpha_max_auc_cv)

plt.plot(alpha,train_auc,label='Train_AUC')
plt.plot(alpha,cv_auc,label='CV_AUC')
plt.scatter(alpha,cv_auc)
plt.scatter(alpha,train_auc)
#https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xscale.html
plt.xscale('log')
plt.title("AUC: ERROR plot")
plt.ylabel('AUC')
plt.xlabel('Alpha:Hyperparameter')
plt.legend()
plt.grid(1)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc

nb = LogisticRegression(C=10**5,class_weight='balanced',penalty='l2')
nb.fit(X_tr, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = nb.predict_proba(X_tr)[:,1]    
y_test_pred = nb.predict_proba(X_te)[:,1]

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr )))
plt.legend()
plt.xlabel("alpha: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

In [None]:
import seaborn as sns
#https://stackoverflow.com/a/33158941/10967428
con_tr=confusion_matrix(y_train,predict(y_train_pred,tr_thresholds,train_fpr,train_tpr))

sns.heatmap(con_tr,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Train Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

In [None]:
import seaborn as sns
#https://stackoverflow.com/a/33158941/10967428
con_te=confusion_matrix(y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_tpr))

sns.heatmap(con_te,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Test Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

### 2.4.2 Applying Logistic regression on TFIDF avgw2v<font color='red'> SET 4</font>

In [None]:
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_tr = hstack((X_train_state_ohe,X_train_teacher_ohe,X_train_grade_ohe,X_train_subcat_ohe,
               X_train_cat_ohe,X_train_price_norm,X_train_teacher_number_of_previously_posted_projects_norm,
               X_train_text_tfidf_w2v,X_train_title_tfidf_w2v)).tocsr()

X_cr = hstack((X_cv_state_ohe,X_cv_teacher_ohe,X_cv_grade_ohe,X_cv_subcat_ohe,
               X_cv_cat_ohe,X_cv_price_norm,X_cv_teacher_number_of_previously_posted_projects_norm,
               X_cv_text_tfidf_w2v,X_cv_title_tfidf_w2v)).tocsr()

X_te = hstack((X_test_state_ohe,X_test_teacher_ohe,X_test_grade_ohe,X_test_subcat_ohe,
               X_test_cat_ohe,X_test_price_norm,X_test_teacher_number_of_previously_posted_projects_norm,
               X_test_text_tfidf_w2v,X_test_title_tfidf_w2v)).tocsr()

print("Final Data matrix")
print(X_tr.shape, y_train.shape)
print(X_cr.shape, y_cv.shape)
print(X_te.shape, y_test.shape)
print("="*100)

In [None]:
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

train_auc=[]
cv_auc=[]
alpha= [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 10**5]
b=[]
for i in tqdm(alpha):
    mb=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
    mb.fit(X_tr,y_train)
    
    y_train_pred=batch_predict(mb,X_tr)
    y_cv_pred=batch_predict(mb,X_cr)
    
    train_auc.append(roc_auc_score(y_train,y_train_pred))
    cv_auc.append(roc_auc_score(y_cv,y_cv_pred))
    
max_auc_ind_train=train_auc.index(max(train_auc))
alpha_max_auc_train=alpha[max_auc_ind_train]

print("max auc in train data:",max(train_auc))
print("alpha value for maximum AUC:",alpha_max_auc_train)
max_auc_ind_cv=cv_auc.index(max(cv_auc))
alpha_max_auc_cv=alpha[max_auc_ind_cv]
print("max auc  in cv data:",max(cv_auc))
print("alpha value for maximum AUC:",alpha_max_auc_cv)

plt.plot(alpha,train_auc,label='Train_AUC')
plt.plot(alpha,cv_auc,label='CV_AUC')
plt.scatter(alpha,cv_auc)
plt.scatter(alpha,train_auc)
#https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xscale.html
plt.xscale('log')
plt.title("AUC: ERROR plot")
plt.ylabel('AUC')
plt.xlabel('Alpha:Hyperparameter')
plt.legend()
plt.grid(1)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc

nb = LogisticRegression(C=10**5,class_weight='balanced',penalty='l2')
nb.fit(X_tr, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = nb.predict_proba(X_tr)[:,1]    
y_test_pred = nb.predict_proba(X_te)[:,1]

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr )))
plt.legend()
plt.xlabel("alpha: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

In [None]:
import seaborn as sns
#https://stackoverflow.com/a/33158941/10967428
con_tr=confusion_matrix(y_train,predict(y_train_pred,tr_thresholds,train_fpr,train_tpr))

sns.heatmap(con_tr,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Train Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

In [None]:
import seaborn as sns
#https://stackoverflow.com/a/33158941/10967428
con_te=confusion_matrix(y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_tpr))

sns.heatmap(con_te,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Test Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

###  Applying Logistic regression on<font color='red'> SET 5</font>

In [None]:
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
X_tr = hstack((X_train_state_ohe,X_train_teacher_ohe,X_train_grade_ohe,X_train_subcat_ohe,
               X_train_cat_ohe,X_train_price_norm,X_train_teacher_number_of_previously_posted_projects_norm,
               X_train_quan_norm, X_train_now_title_norm, X_train_now_text_norm,
               X_train_pos_norm, X_train_neg_norm, X_train_neu_norm,
               X_train_com_norm)).tocsr()

X_cr = hstack((X_cv_state_ohe,X_cv_teacher_ohe,X_cv_grade_ohe,X_cv_subcat_ohe,
               X_cv_cat_ohe,X_cv_price_norm,X_cv_teacher_number_of_previously_posted_projects_norm,
               X_cv_quan_norm,X_cv_now_title_norm, X_cv_now_text_norm,
               X_cv_pos_norm, X_cv_neg_norm, X_cv_neu_norm,
               X_cv_com_norm)).tocsr()

X_te = hstack((X_test_state_ohe,X_test_teacher_ohe,X_test_grade_ohe,X_test_subcat_ohe,
               X_test_cat_ohe,X_test_price_norm,X_test_teacher_number_of_previously_posted_projects_norm,
               X_test_quan_norm,X_test_now_title_norm, X_test_now_text_norm,
               X_test_pos_norm, X_test_neg_norm, X_test_neu_norm,
               X_test_com_norm)).tocsr()

print("Final Data matrix")
print(X_tr.shape, y_train.shape)
print(X_cr.shape, y_cv.shape)
print(X_te.shape, y_test.shape)
print("="*100)

In [None]:
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

train_auc=[]
cv_auc=[]
alpha= [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 10**5]
b=[]
for i in tqdm(alpha):
    mb=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
    mb.fit(X_tr,y_train)
    
    y_train_pred=batch_predict(mb,X_tr)
    y_cv_pred=batch_predict(mb,X_cr)
    
    train_auc.append(roc_auc_score(y_train,y_train_pred))
    cv_auc.append(roc_auc_score(y_cv,y_cv_pred))
    
max_auc_ind_train=train_auc.index(max(train_auc))
alpha_max_auc_train=alpha[max_auc_ind_train]

print("max auc in train data:",max(train_auc))
print("alpha value for maximum AUC:",alpha_max_auc_train)
max_auc_ind_cv=cv_auc.index(max(cv_auc))
alpha_max_auc_cv=alpha[max_auc_ind_cv]
print("max auc  in cv data:",max(cv_auc))
print("alpha value for maximum AUC:",alpha_max_auc_cv)

plt.plot(alpha,train_auc,label='Train_AUC')
plt.plot(alpha,cv_auc,label='CV_AUC')
plt.scatter(alpha,cv_auc)
plt.scatter(alpha,train_auc)
#https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xscale.html
plt.xscale('log')
plt.title("AUC: ERROR plot")
plt.ylabel('AUC')
plt.xlabel('Alpha:Hyperparameter')
plt.legend()
plt.grid(1)

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc

nb = LogisticRegression(C=1000,class_weight='balanced',penalty='l2')
nb.fit(X_tr, y_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = nb.predict_proba(X_tr)[:,1]    
y_test_pred = nb.predict_proba(X_te)[:,1]

train_fpr, train_tpr, tr_thresholds = roc_curve(y_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(y_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr )))
plt.legend()
plt.xlabel("alpha: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()

In [None]:
import seaborn as sns
#https://stackoverflow.com/a/33158941/10967428
con_tr=confusion_matrix(y_train,predict(y_train_pred,tr_thresholds,train_fpr,train_tpr))

sns.heatmap(con_tr,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Train Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

In [None]:
import seaborn as sns
#https://stackoverflow.com/a/33158941/10967428
con_te=confusion_matrix(y_test, predict(y_test_pred, tr_thresholds, test_fpr, test_tpr))

sns.heatmap(con_te,annot=True,fmt='0.00f',annot_kws={'size':10})
plt.title("Test Confusion Matrix")
plt.ylabel("Actual value")
plt.xlabel("Predicted value")
plt.show()

<h1> Conclusions </h1>

In [None]:
# Please compare all your models using Prettytable library
from prettytable import PrettyTable
    
x = PrettyTable()
x.field_names = ["Vectorizer",  "Hyper parameter", "AUC"]
x.add_row(["BOW", 1, 0.65])
x.add_row(["TFIDF", 0.1, 0.69])
x.add_row(["AVGW2V", 100000, 0.68])
x.add_row(["TFIDF W2V", 0.1, 0.68])
x.add_row(["NO TEXT", 100, 0.63])

print(x)

<h1>Summary</h1>

1. On comparing both the results we see that TFIDF featurization works a bit well in terms precision and recall
2. Its very good compared to kNN in terms of execution time and in terms of accuracy and execution time.