# Problem Statement

The machine learning model cannot work with raw text data directly. In the end, machine learning models work with numerical (categorical, real,…) features. So it is important to change these non-numerical data into numerical vector so that we can use the whole power of linear algebra.

Some columns of the dataset provided to you faces the same problem. The csv file open_code_2.csv contains a real world data which has some categorical features.
You have two tasks
<ul>
<li>Check the value count of each category in <code>category</code> column.</li>
<li>Do one hot encoding of the column <code>category</code> where all the categories having count less than 25 are grouped in a seperate category named <code>other_category</code></li>
</ul>

As mentioned earlier you are only allowed to use numpy and pandas for this task.
Do not edit the cells that are marked as UNIQUE as these cell are provided to help you in completing this assignment.


# In Colab

This task does not require you do use gpu so you can work on it on your own cpu but if you want to use this notebook on google colab run the below cell and upload 
<ul>
    <li>open_code.csv</li>
    <li>check_task1.py</li>
</ul>

In [1]:
# import os 
# from google.colab import files

# files.upload()

# Import Statements

In [2]:
# UNIQUE 1
# Importing Modules

import numpy as np
import pandas as pd
from check_task2 import *

In [3]:
# UNIQUE 2
# Running this cell with load the csv file and put it in the variable df

df = pd.read_csv('./open_code_2.csv')

In [4]:
# Running this cell will display you the dataframe
# UNIQUE 3
df

Unnamed: 0,category,reviews,size,installs,price,suitable_for,last_update,latest_ver,popularity
0,TOOLS,3988,11M,"1,000,000+",0,Everyone,"December 22, 2015",1.0.5,Low
1,FAMILY,12,13M,"1,000+",0,Everyone,"February 3, 2018",1.3.4,Low
2,FAMILY,407,306k,"50,000+",0,Mature 17+,"April 11, 2017",1.0,Low
3,MEDICAL,19,2.2M,"5,000+",0,Everyone,"May 12, 2018",1.4.15,Low
4,GAME,4416,59M,"500,000+",0,Teen,"January 18, 2017",2.1.7,Low
...,...,...,...,...,...,...,...,...,...
3012,GAME,3883589,57M,"100,000,000+",0,Everyone,"July 26, 2018",2.21.1,High
3013,FAMILY,5898,50M,"100,000+",0,Everyone,"August 1, 2017",3.3.8.03082017,High
3014,FAMILY,16,8.9M,500+,$1.99,Everyone,"May 9, 2017",1.0,High
3015,HEALTH_AND_FITNESS,9612,3.5M,"100,000+",0,Everyone,"May 18, 2018",1.8.12,High


# Task-2

## Task-2A


Categorical data can't be just passed into a model as machine learning works with numerics only.

So, we use a method called <code>One Hot Encoding</code>.

With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0 to those columns. Each integer value is represented as a binary vector. All the values are zero, and the index is marked with a 1.
You can check <a href="https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f">this</a> link to learn more about one hot encoding

We have many category in our column called <code>category</code> and training our model for a column which is as less frequent is 25 just doesn't make sense so we will combine all such categories and put them in seperate column called <code>other_category</code>.

In [5]:
# Write here the code to check the value count of each category of the column named "category"
dic = df["category"].value_counts().to_dict()
list_of_shit = [key for key,value in dic.items() if value<25]
# set of category whose count is less than 25
category_less_than_25 = set(list_of_shit)

To let you know if you are making any errors while completing this task we have given helper functions to check your work.
Success means you have done this part of the task completely and you should proceed to next
Failure means you messed up some part of your code 

In [6]:
# UNIQUE 4
check_small_category(category_less_than_25)

[32m"Success!!"[0m
[32m"Success!!"[0m


Group the categories with count less than 25 into one single category named "other_category"
Basically what you have to do is iterate through the column and change the name of category whose count is
less than 25 to other_category.

In [7]:
# Complete the below function 
# It takes in the column category and then returns a df which has been modified as mentioned above
# Note that the returned dataframe column name should be category
def change_column_category(column_df,category_less_than_25=category_less_than_25):
    '''
    argument:
        column_df : DataFrame df['category']
        category_less_than_25 : A set of all category whose count is less than 25
    returns:
        changed_column_df : A dataframe where the column category has been modified
    '''
    category_list = list(column_df)
    changed_column_list = []
    for category in category_list:
        if category in category_less_than_25:
            changed_column_list.append('other_category')
        else:
            changed_column_list.append(category)
    changed_column_df = pd.DataFrame(data=changed_column_list,columns=['category'])
    return changed_column_df

In [8]:
check_change_column_category(change_column_category(df['category']))

[32m"Success!!"[0m
[32m"Success!!"[0m


In [9]:
# UNIQUE 6
# Running this cell will drop the column category from the original dataframe and the modified one at it's
# place
changed_col_df = change_column_category(df['category'])
df = df.drop(['category'],axis=1)
df = pd.concat([changed_col_df,df],axis=1)

## Task-2B


Apply one hot encoding to the column "Category"

In [10]:
# You are supposed to write a function that takes column name as input and apply one hot encoding on it.
# Make sure the name of the columns of your returned dataframe is same as the categories name
# That means the name of your dataframe column should be GAME FAMILY DATING etc
def column_to_one_hot(column_df):
  '''
    argument:
        column_df : DataFrame df['category']
    returns:
        changed_column_df : A dataframe where the column category has been converted to one_hot
  '''
  cat_to_index_dict = {cat:i for i,cat in enumerate(column_df.unique())}
  fin_df = []
  for cat in list(column_df):
    one_hot = np.zeros(len(cat_to_index_dict))
    one_hot[cat_to_index_dict[cat]]=1
    fin_df.append(one_hot) 
  one_hot_df = pd.DataFrame(fin_df,columns=cat_to_index_dict.keys())
  return one_hot_df


In [11]:
check_column_to_one_hot(column_to_one_hot(df['category']),df)

[32m"Success!!"[0m
[32m"Success!!"[0m
[32m"Success!!"[0m


In [12]:
column_to_one_hot(df['category'])

Unnamed: 0,TOOLS,FAMILY,MEDICAL,GAME,SOCIAL,DATING,LIFESTYLE,SPORTS,NEWS_AND_MAGAZINES,PRODUCTIVITY,...,COMMUNICATION,BUSINESS,SHOPPING,AUTO_AND_VEHICLES,PERSONALIZATION,MAPS_AND_NAVIGATION,other_category,COMICS,FOOD_AND_DRINK,EDUCATION
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3012,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3013,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3014,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Submisson

Once you are done with the assignment open this notebook in google colab and share the notebook with the people specified below. To avoid plagarism you are adviced to not upload the notebook but just the link with access to only specified persons.
<table>
    <tr>
        <th>Pratyaksh Singh </th>
        <th>iib2020015@iiita.ac.in </th>
    </tr>
    <tr>
        <th>Himanshu Bhawnani </th>
        <th>iib2020035@iiita.ac.in </th>
    </tr>
    <tr>
        <th>Parth Soni </th>
        <th>iec2020132@iiita.ac.in </th>
    </tr>
    <tr>
        <th>Utkarsh Singh </th>
        <th>iec2020029@iiita.ac.in </th>
    </tr>
</table>