# Problem Statement

Preprocessing data is on of the most important task of a ML engineer and it takes about 50-60% of the time while developing any project. Data is collected by scraping the web or by any other means and it is always possible that the dataset is not in the form you will like it to be.
The dataset provided to you faces the same problem. The csv file open_code_2.csv contains a real world data which has been well structured by us but still requires some preprocessing to be done.
You have three tasks
<ul>
<li>Featurisation of column named <code>size</code>.</li>
<li>Featurisation of column named <code>installs</code></li>
<li>Featurisation of column named <code>price</code></li> 
</ul>
As mentioned earlier you are only allowed to use numpy and pandas for this task.
Do not edit the cells that are marked as UNIQUE as these cell are provided to help you in completing this assignment.


# In Colab

This task does not require you do use gpu so you can work on it on your own cpu but if you want to use this notebook on google colab run the below cell and upload 
<ul>
    <li>open_code_2.csv</li>
    <li>check_task3.py</li>
</ul>

In [None]:
import os 
from google.colab import files

files.upload()

Saving check_task3.py to check_task3.py


{'check_task3.py': b'import numpy as np\nimport pandas as pd\n\ndef print_success():\n    print("\\x1b[32m\\"Success!!\\"\\x1b[0m")\n\ndef print_fail():\n    print("\\x1b[31m\\"Failure!!\\"\\x1b[0m")\n\ndef is_dataframe(df):\n    return isinstance(df,pd.DataFrame)\n\ndef is_float(df,col):\n    return df.dtypes.to_dict()[col]=="float64"\n\ndef check_float_or_df(df,col):\n    if is_dataframe(df):\n        print_success()\n    else:\n        print_fail()\n        print("The output should be a datframe")\n    if is_float(df,col):\n        print_success()\n    else:\n        print_fail()\n        print("The column element should be an integer")\n\ndef check_clean_size(df):\n    check_float_or_df(df,\'size\')\n\ndef check_clean_install(df):\n    check_float_or_df(df,\'install\')\n\ndef check_clean_price(df):\n    check_float_or_df(df,\'price\')'}

# Import Statements

In [None]:
# UNIQUE 1
# Importing Modules

import numpy as np
import pandas as pd
from check_task3 import *

# Understanding Data

In [None]:
# UNIQUE 2
# Running this cell with laod the csv file and put it in the variable df

df = pd.read_csv('./open_code_2.csv')

In [None]:
# UNIQUE 3
# Running this cell will show you how the dataset looks like 

df.head(10)

Unnamed: 0,category,reviews,size,installs,price,suitable_for,last_update,latest_ver,popularity
0,TOOLS,3988,11M,"1,000,000+",0,Everyone,"December 22, 2015",1.0.5,Low
1,FAMILY,12,13M,"1,000+",0,Everyone,"February 3, 2018",1.3.4,Low
2,FAMILY,407,306k,"50,000+",0,Mature 17+,"April 11, 2017",1.0,Low
3,MEDICAL,19,2.2M,"5,000+",0,Everyone,"May 12, 2018",1.4.15,Low
4,GAME,4416,59M,"500,000+",0,Teen,"January 18, 2017",2.1.7,Low
5,FAMILY,246,35M,"50,000+",0,Teen,"September 23, 2016",1.0,Low
6,SOCIAL,20675,96M,"1,000,000+",0,Teen,"August 2, 2018",4.4.0,Low
7,GAME,1976,21M,"100,000+",0,Teen,"March 3, 2018",1.6,Low
8,DATING,1093,5.8M,"100,000+",0,Mature 17+,"January 28, 2017",1.6.6,Low
9,GAME,2071,36M,"100,000+",0,Everyone,"June 29, 2018",1.0.3,Low


In [None]:
# We encourage you to do for EDA if you feel necessary because there is no known disadvantages of 
# understanding the dataset better :)

# Task - 3

## Task 3A

This task requires you to convert the values given in the column <code>size</code> to numerical values first you will check the representation of the size in the column. The file size ending with k denotes that the file is in kb while the file size ending with M denotes that the file is in mb.

In [None]:
# Write your code to ensure that all the files are in either kb or mb.

nun_dic = {}
for i in list(df['size']):
    last_char = i[-1]
    nun_dic.setdefault(last_char,0)
    nun_dic[last_char]+=1
print(nun_dic) 

{'M': 2911, 'k': 106}


In [None]:
# Write a function that takes in the column 'size' of the dataframe df and then return another dataframe 
# named clean_size_df. In this dataframe the file size which are in mb 
# must be multiplied with 1024 to convert them in kb too. The column element must be float that means
# the terminating unit (k/M) should be removed
# Returned dataframe column name should be same here size

def clean_size(column_df):
    '''
    argument:
        column_df = Datafram of the column named size
    returns:
        clean_size_df = Dataframe where elements of column_df are changed as specified above
    '''
    size_array = []
    for i in list(column_df):
        last_char = i[-1]
        rest_num = i[:-1]
        if last_char=='M':
            size_array.append(float(rest_num)*1000)
        else:
            size_array.append(float(rest_num))
    clean_size_df = pd.DataFrame(size_array,columns=['sizess'])
    return clean_size_df

To let you know if you are making any errors while completing this task we have given helper functions to check your work.
Success means you have done this part of the task completely and you should proceed to next
Failure means you messed up some part of your code 

In [None]:
df_l = clean_size(df['size'])

In [None]:
df_l

Unnamed: 0,sizess
0,11000.0
1,13000.0
2,306.0
3,2200.0
4,59000.0
...,...
3012,57000.0
3013,50000.0
3014,8900.0
3015,3500.0


In [None]:
# UNIQUE 4

check_clean_size(clean_size(df['size']))

[32m"Success!!"[0m
[32m"Success!!"[0m


## Task 3B

In this task you are supposd to featurise the column names <code>installs</code>.
You are supposed to:
<ul>
    <li>Remove the comma(,) from the string</li>
    <li>Remove the plus sign (+) from the end of the string </li>
    <li>Convert the string to float</li>
</ul>

In [None]:
# Running this code block will show you how install looks
df['installs']

0         1,000,000+
1             1,000+
2            50,000+
3             5,000+
4           500,000+
            ...     
3012    100,000,000+
3013        100,000+
3014            500+
3015        100,000+
3016         10,000+
Name: installs, Length: 3017, dtype: object

In [None]:
# Write a functions that takes in the column named installs and then returns a dataframe after modifying 
# installs as specified above
# Returned dataframe column name should be same here install

def clean_install(column_df):
    '''
    parameters:
        df : Dataframe of the column install
    returns:
        final_df : The final dataframe after modifying column_df as specified above
    '''
    fin = []
    for i in list(column_df):
        fin.append(float("".join((str(i).split("+")[0]).split(","))))
    final_df = pd.DataFrame(fin,columns=['install'])
    return final_df 

In [None]:
lul = clean_install(df['installs'])
lul

Unnamed: 0,install
0,1000000.0
1,1000.0
2,50000.0
3,5000.0
4,500000.0
...,...
3012,100000000.0
3013,100000.0
3014,500.0
3015,100000.0


In [None]:
# UNIQUE 5

check_clean_install(clean_install(df['installs']))

[32m"Success!!"[0m
[32m"Success!!"[0m


## Task 3C 

You are required to featurize the column named <code>price</code>
You are supposed to:
<ul>
    <li>Remove the dollor sign if it exists.</li>
    <li>Convert the string to float. </li>
</ul>

In [None]:
# Run this cell to see how the column price looks like
df['price']


0           0
1           0
2           0
3           0
4           0
        ...  
3012        0
3013        0
3014    $1.99
3015        0
3016        0
Name: price, Length: 3017, dtype: object

In [None]:
# Write a functions that takes in the column named price and then returns a dataframe after modifying 
# price as specified above
# Returned dataframe column name should be same here price

def clean_price(column_df):
    '''
    parameters:
        df : Dataframe of the column price
    returns:
        final_df : The final dataframe after modifying column_df as specified above
    '''
    fin_data = []
    for i in list(column_df):
        fin_data.append(float(str(i).split('$')[-1]))
    final_df = pd.DataFrame(fin_data,columns=['price'])
    return final_df

In [None]:
#UNIQUE 6

check_clean_price(clean_price(df['price']))

[32m"Success!!"[0m
[32m"Success!!"[0m


# Submisson

Once you are done with the assignment open this notebook in google colab and share the notebook with the people specified below. To avoid plagarism you are adviced to not upload the notebook but just the link with access to only specified persons.
<table>
    <tr>
        <th>Pratyaksh Singh </th>
        <th>iib2020015@iiita.ac.in </th>
    </tr>
    <tr>
        <th>Himanshu Bhawnani </th>
        <th>iib2020035@iiita.ac.in </th>
    </tr>
    <tr>
        <th>Parth Soni </th>
        <th>iec2020132@iiita.ac.in </th>
    </tr>
    <tr>
        <th>Utkarsh Singh </th>
        <th>iec2020029@iiita.ac.in </th>
    </tr>
</table>

# Enjoy Open Code