# Problem Statement

Preprocessing data is on of the most important task of a ML engineer and it takes about 50-60% of the time while developing any project. Data is collected by scraping the web or by any other means and it is always possible that the dataset is not in the form you will like it to be.
The dataset provided to you faces the same problem. The csv file open_code.csv contains a real world data which has been well structured by us but still requires some preprocessing to be done.
You have two tasks
<ul>
<li>Check for NaN values in the dataset and drop them.</li>
<li>Check if the dataset is imbalanced if it is then balance the dataset using upsampling</li>
<li>Drop the column named <code>app_id</code></li> 
</ul>
As mentioned earlier you are only allowed to use numpy and pandas for this task.
Do not edit the cells that are marked as UNIQUE as these cell are provided to help you in completing this assignment.


# In Colab

This task does not require you do use gpu so you can work on it on your own cpu but if you want to use this notebook on google colab run the below cell and upload 
<ul>
    <li>open_code.csv</li>
    <li>check_task1.py</li>
</ul>

In [None]:
import os 
from google.colab import files

files.upload()

Saving check_task1.py to check_task1.py
Saving open_code.csv to open_code.csv


{'check_task1.py': b'import numpy as np\nimport pandas as pd\n\ndef print_success():\n    print("\\x1b[32m\\"Success!!\\"\\x1b[0m")\n\ndef print_fail():\n    print("\\x1b[31m\\"Failure!!\\"\\x1b[0m")\n\ndef check_cleaned_data(df):\n    expected_shape = (1971,10)\n    any_null = any(df.isnull().any())\n    if df.shape==expected_shape and not any_null:\n        print_success()\n    else:\n        print_fail()\n\ndef is_indexed_properly(final_df):\n    ind_arr = np.array(final_df.index)\n    return len(np.unique(ind_arr))==len(ind_arr)\n\ndef is_balanced(final_df):\n    final_df_dic = final_df[\'popularity\'].value_counts().to_dict()\n    return abs(final_df_dic[\'High\']-final_df_dic[\'Low\'])<150\n\ndef is_downsampled(final_df):\n    final_df_dic = final_df[\'popularity\'].value_counts().to_dict()\n    return final_df_dic[\'High\']==1448\n\ndef check_balanced_data(final_df):\n    if is_indexed_properly(final_df):\n        print_success()\n    else:\n        print_fail()\n        print("

# Import Statements

In [None]:
# UNIQUE 1
# Importing Modules

import numpy as np
import pandas as pd
from check_task1 import *

# Understanding Data

In [None]:
# UNIQUE 2
# Running this cell with laod the csv file and put it in the variable df

df = pd.read_csv('./open_code.csv')

In [None]:
# UNIQUE 3
# Running this cell will show you how the dataset looks like 

df.head(10)

Unnamed: 0,app_id,category,reviews,size,installs,price,suitable_for,last_update,latest_ver,popularity
0,330090,PERSONALIZATION,4,511k,50+,0,Everyone,"December 31, 2016",1.4,High
1,226147,GAME,568391,5.2M,"5,000,000+",0,Teen,"July 1, 2014",4.3.1,High
2,107000,FAMILY,144,70M,"1,000+",$2.99,Teen,"January 26, 2018",1.0.0,High
3,217582,FAMILY,1499466,96M,"10,000,000+",0,Teen,"July 24, 2018",1.25.0,High
4,370113,DATING,84,4.5M,"1,000+",0,Mature 17+,"July 6, 2018",8.2,High
5,628931,PARENTING,247,28M,"100,000+",0,Everyone,"March 19, 2018",1.3.0,High
6,72280,PHOTOGRAPHY,180697,6.1M,"10,000,000+",0,Everyone,"April 25, 2017",2.2.5,High
7,793815,TOOLS,3988,11M,"1,000,000+",0,Everyone,"December 22, 2015",1.0.5,Low
8,660969,FAMILY,12,13M,"1,000+",0,Everyone,"February 3, 2018",1.3.4,Low
9,732069,MEDICAL,6,26M,"1,000+",0,Everyone,"May 25, 2018",1.0.32,High


In [None]:
# We encourage you to do for EDA if you feel necessary because there is no known disadvantages of 
# understanding the dataset better :)

# Task - 1

## Task 1A

Sometime while collecting data few of the datapoints are left behind because of many error such as problem in data transfering or the data simply being not available.
Having a datapoint as NaN creates lots of problem while training and there are numerous ways of dealing with it the most easiest and the one which you will be using in this task is dropping the rows that contain the NaN value

In [None]:
# Write your code to check if the dataset contains NaN value
df[df.isna()['latest_ver']==True] 

Unnamed: 0,app_id,category,reviews,size,installs,price,suitable_for,last_update,latest_ver,popularity
123,101207,MEDICAL,6,16M,500+,0,Everyone,"August 4, 2018",,High
337,425998,ART_AND_DESIGN,55,2.7M,"5,000+",0,Everyone,"June 6, 2018",,Low
1536,686458,HEALTH_AND_FITNESS,14394,9.9M,"500,000+",0,Everyone,"July 16, 2017",,High
1826,431459,SOCIAL,44,6.3M,"1,000+",0,Teen,"May 21, 2018",,Low


In [None]:
# Now drop the NaN value if your dataset contains it.
# Store the final dataframe in df itself
# Replace none with your own code
# You can use more than one lines of code if you want to. :)
df = df.dropna()

To let you know if you are making any errors while completing this task we have given helper functions to check your work.
Success means you have done this part of the task completely and you should proceed to next
Failure means you messed up some part of your code 

In [None]:
# UNIQUE 4

check_cleaned_data(df)

[32m"Success!!"[0m


## Task 1B

Imbalanced data also creates problem cause the model is slightly better and biased toward the majority data and it is advised to balance the data before training 

In [None]:
# Check if the dataset is balanced based on the rating of the app (High/Low)
# Write your code below
df['popularity'].value_counts()

High    1448
Low      523
Name: popularity, dtype: int64

Now you know if the dataset is balanced or not and there are many ways to balance an imbalanced dataset and here we will be using one of them called as upsampling so basically what you have to do is repeat the minority class n number of times (where n is a natural number) till your dataset is balanced.

To let you have more fun in this assignment you are allowed to use numpy and pandas only ;)
<details>
    <summary>
        <font size="3" color="green">
            <b> <i> Optional hints  </i> </b>
        </font>
    </summary>
    You can check out the documentation of <code>pd.Datafame.groupby</code> <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html">here</a><br>
    You can check out the documentation of <code>pd.concat</code> <a href="https://pandas.pydata.org/docs/reference/api/pandas.concat.html">here</a>
</details>

In [None]:
# You are supposed to write a function that takes in the inblanced data and then balance it.
# Your final data should be stored in dataframe final_df and the indexing of dataframe should be proper.
# Complete the function below

def upsample_dataset(df):
    '''
    parameters:
        df : An unbalanced dataset
    returns:
        final_df : A balanced dataset using upsampling
    '''
    g = df.groupby('popularity')
    final_df = pd.concat([g.get_group('Low'),g.get_group('Low'),g.get_group('Low'),g.get_group('High')],ignore_index=True)
    return final_df 

In [None]:
# UNIQUE 5

check_balanced_data(upsample_dataset(df))

[32m"Success!!"[0m
[32m"Success!!"[0m
[32m"Success!!"[0m


## Task 1C 

You are required to drop the column named <code>app_id</code> as it holds no meaning in training 

In [None]:
# Storing the balanced df

final_df = upsample_dataset(df)

In [None]:
# Write code to check the columns present in your dataset
final_df.columns

Index(['app_id', 'category', 'reviews', 'size', 'installs', 'price',
       'suitable_for', 'last_update', 'latest_ver', 'popularity'],
      dtype='object')

In [None]:
# Write code to drop the column app_id

dropped_app_id_df = df.drop(['app_id'],axis=1)

In [None]:
#UNIQUE 6

check_column(dropped_app_id_df)

[32m"Success!!"[0m
[32m"Success!!"[0m


# Submisson

Once you are done with the assignment open this notebook in google colab and share the notebook with the people specified below. To avoid plagarism you are adviced to not upload the notebook but just the link with access to only specified persons.
<table>
    <tr>
        <th>Pratyaksh Singh </th>
        <th>iib2020015@iiita.ac.in </th>
    </tr>
    <tr>
        <th>Himanshu Bhawnani </th>
        <th>iib2020035@iiita.ac.in </th>
    </tr>
    <tr>
        <th>Parth Soni </th>
        <th>iec2020132@iiita.ac.in </th>
    </tr>
    <tr>
        <th>Utkarsh Singh </th>
        <th>iec2020029@iiita.ac.in </th>
    </tr>
</table>

# Enjoy Open Code