# #EDA & Data Preprocessing on Google App Store Rating Dataset.

# Domain: Mobile device apps
Context:
The Play Store apps data has enormous potential to drive app-making businesses to success. However, many
apps are being developed every single day and only a few of them become profitable. It is important for
developers to be able to predict the success of their app and incorporate features which makes an app
successful. Before any such predictive-study can be done, it is necessary to do EDA and data-preprocessing on
the apps data available for google app store applications. From the collected apps data and user ratings from
the app stores, let's try to extract insightful information.


# Objective:
The Goal is to explore the data and pre-process it for future use in any predictive analytics study.
Data set Information:
Web scraped data of 10k Play Store apps for analyzing the Android market. Each app (row) has values for
category, rating, size, and more.


# Questions:-

# 1. Import required libraries and read the dataset.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

In [2]:
data_1=pd.read_csv('C:\\Users\\guthireddy praveen\\Downloads\\PYTHON PROJECTS GL 2023\\Apps_data+(1).csv')


# 2. Check the first few samples, shape, info of the data and try to familiarize yourself with different features.

In [3]:
data_1.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [4]:
data_1.shape

(10841, 13)

In [5]:
data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [6]:
data_1.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

# 3. Check summary statistics of the dataset. List out the columns that need to be worked upon for model building. 

In [7]:
data_1.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rating,9367.0,4.193338,0.537431,1.0,4.0,4.3,4.5,19.0


In [8]:
data_1.describe(include='O').T

Unnamed: 0,count,unique,top,freq
App,10841,9660,ROBLOX,9
Category,10841,34,FAMILY,1972
Reviews,10841,6002,0,596
Size,10841,462,Varies with device,1695
Installs,10841,22,"1,000,000+",1579
Type,10840,3,Free,10039
Price,10841,93,0,10040
Content Rating,10840,6,Everyone,8714
Genres,10841,120,Tools,842
Last Updated,10841,1378,"August 3, 2018",326


# 4. Check if there are any duplicate records in the dataset? if any drop them.

In [9]:
data_1 = data_1.drop_duplicates()

In [10]:
data_1

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


# 5. Check the unique categories of the column 'Category', Is there any invalid category? If yes, drop them.


In [11]:
unique_categories = data_1['Category'].unique()

In [12]:
unique_categories

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION',
       '1.9'], dtype=object)

In [13]:
data_1 = data_1.drop(data_1[data_1['Category'] == 'invalid_category'].index)

In [14]:
data_1

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


# 6. Check if there are missing values present in the column Rating, If any? drop them and and create a new column as 'Rating_category' by converting ratings to high and low categories(>3.5 is high rest low)


In [15]:
missing_mask = data_1['Rating'].isnull()

In [16]:
missing_mask

0        False
1        False
2        False
3        False
4        False
         ...  
10836    False
10837    False
10838     True
10839    False
10840    False
Name: Rating, Length: 10358, dtype: bool

In [17]:
data_1.dropna(subset=['Rating'], inplace=True)

In [18]:
data_1

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10834,FR Calculator,FAMILY,4.0,7,2.6M,500+,Free,0,Everyone,Education,"June 18, 2017",1.0.0,4.1 and up
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [19]:
data_1['Rating_category'] = data_1['Rating'].apply(lambda x: 'high' if x > 3.5 else 'low')

In [20]:
data_1

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Rating_category
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,high
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,high
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,high
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,high
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,high
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10834,FR Calculator,FAMILY,4.0,7,2.6M,500+,Free,0,Everyone,Education,"June 18, 2017",1.0.0,4.1 and up,high
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up,high
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up,high
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device,high


# 7. Check the distribution of the newly created column 'Rating_category' and comment on the distribution.

In [21]:
rating_counts = data_1['Rating_category'].value_counts()
rating_counts

high    8013
low      880
Name: Rating_category, dtype: int64

# Distribution
It's also worth noting that the distribution of the 'Rating_category' column may be influenced by factors such as the size of the dataset and the underlying distribution of ratings. It's always a good idea to carefully examine the distribution of any new variables or features created from data to ensure they are behaving as expected.

# 8. Convert the column "Reviews'' to numeric data type and check the presence of outliers in the column and handle the outliers using a transformation approach.(Hint: Use log transformation)


In [22]:
data_1['Reviews'] = pd.to_numeric(data_1['Reviews'], errors='coerce')

In [23]:
data_1.dtypes

App                 object
Category            object
Rating             float64
Reviews            float64
Size                object
Installs            object
Type                object
Price               object
Content Rating      object
Genres              object
Last Updated        object
Current Ver         object
Android Ver         object
Rating_category     object
dtype: object

In [24]:
review_stats = data_1['Reviews'].describe()

In [25]:
review_stats

count    8.892000e+03
mean     4.727764e+05
std      2.905052e+06
min      1.000000e+00
25%      1.640000e+02
50%      4.714500e+03
75%      7.126675e+04
max      7.815831e+07
Name: Reviews, dtype: float64

In [26]:
data_1['Reviews_log'] = np.log(data_1['Reviews'])

In [27]:
data_1.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Rating_category,Reviews_log
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,high,5.068904
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,high,6.874198
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,high,11.379508
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,high,12.281384
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,high,6.874198


# 9. The column 'Size' contains alphanumeric values, treat the non numeric data and convert the column into suitable data type. (hint: Replace M with 1 million and K with 1 thousand, and drop the entries where size='Varies with device')

In [28]:
replace_dict = {'M': '1000000','K':'1000'}
data_1['Size'] = data_1['Size'].replace(replace_dict, regex=True)
data_1.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Rating_category,Reviews_log
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,191000000.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,high,5.068904
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,141000000.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,high,6.874198
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,8.71,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,high,11.379508
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,251000000.0,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,high,12.281384
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2.81,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,high,6.874198


In [29]:
data_1['Size'] = pd.to_numeric(data_1['Size'], errors='coerce')

In [30]:
data_1 = data_1.drop(data_1[data_1['Size'] == 'Varies with device'].index)

In [31]:
data_1['Size'].dtypes

dtype('float64')

In [32]:
data_1

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Rating_category,Reviews_log
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,1.910000e+08,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,high,5.068904
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,1.410000e+08,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,high,6.874198
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,8.710000e+00,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,high,11.379508
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,2.510000e+08,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,high,12.281384
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2.810000e+00,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,high,6.874198
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10834,FR Calculator,FAMILY,4.0,7.0,2.610000e+00,500+,Free,0,Everyone,Education,"June 18, 2017",1.0.0,4.1 and up,high,1.945910
10836,Sya9a Maroc - FR,FAMILY,4.5,38.0,5.310000e+08,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up,high,3.637586
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4.0,3.610000e+00,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up,high,1.386294
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114.0,,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device,high,4.736198


# 10. Check the column 'Installs', treat the unwanted characters and convert the column into a suitable data type.


In [33]:
data_1['Installs'].head()

0        10,000+
1       500,000+
2     5,000,000+
3    50,000,000+
4       100,000+
Name: Installs, dtype: object

In [34]:
data_1['Installs'] = data_1['Installs'].str.replace(',', '')
data_1['Installs'] = data_1['Installs'].str.replace('+', '')
data_1['Installs'] = data_1['Installs'].str.replace('Free', '0')

In [35]:
data_1['Installs'].head()

0       10000
1      500000
2     5000000
3    50000000
4      100000
Name: Installs, dtype: object

In [36]:
data_1['Installs'].info()

<class 'pandas.core.series.Series'>
Int64Index: 8893 entries, 0 to 10840
Series name: Installs
Non-Null Count  Dtype 
--------------  ----- 
8893 non-null   object
dtypes: object(1)
memory usage: 139.0+ KB


In [37]:
data_1['Installs'] = data_1['Installs'].astype(int)

In [38]:
data_1['Installs'].info()

<class 'pandas.core.series.Series'>
Int64Index: 8893 entries, 0 to 10840
Series name: Installs
Non-Null Count  Dtype
--------------  -----
8893 non-null   int32
dtypes: int32(1)
memory usage: 104.2 KB


# 11. Check the column 'Price' , remove the unwanted characters and convert the column into a suitable data type.

In [39]:
data_1['Price'].head()

0    0
1    0
2    0
3    0
4    0
Name: Price, dtype: object

In [40]:
data_1['Price'].info()

<class 'pandas.core.series.Series'>
Int64Index: 8893 entries, 0 to 10840
Series name: Price
Non-Null Count  Dtype 
--------------  ----- 
8893 non-null   object
dtypes: object(1)
memory usage: 139.0+ KB


In [41]:
data_1['Price'] = data_1['Price'].str.replace('$', '')
data_1['Price'] = data_1['Price'].str.replace('Everyone', '')
data_1['Price'] = data_1['Price'].str.replace('', '0')

In [42]:
data_1['Price'] = data_1['Price'].astype(float)

In [43]:
data_1['Price'].info()

<class 'pandas.core.series.Series'>
Int64Index: 8893 entries, 0 to 10840
Series name: Price
Non-Null Count  Dtype  
--------------  -----  
8893 non-null   float64
dtypes: float64(1)
memory usage: 139.0 KB


# 12. Drop the columns which you think redundant for the analysis.(suggestion: drop column 'rating', since we created a new feature from it (i.e. rating_category) and the columns 'App', 'Rating' ,'Genres','Last Updated', 'Current Ver','Android Ver' columns since which are redundant for our analysis)

In [44]:
data_1.drop(['App', 'Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], axis=1, inplace=True)

In [45]:
data_1

Unnamed: 0,Category,Reviews,Size,Installs,Type,Price,Content Rating,Rating_category,Reviews_log
0,ART_AND_DESIGN,159.0,1.910000e+08,10000,Free,0.0,Everyone,high,5.068904
1,ART_AND_DESIGN,967.0,1.410000e+08,500000,Free,0.0,Everyone,high,6.874198
2,ART_AND_DESIGN,87510.0,8.710000e+00,5000000,Free,0.0,Everyone,high,11.379508
3,ART_AND_DESIGN,215644.0,2.510000e+08,50000000,Free,0.0,Teen,high,12.281384
4,ART_AND_DESIGN,967.0,2.810000e+00,100000,Free,0.0,Everyone,high,6.874198
...,...,...,...,...,...,...,...,...,...
10834,FAMILY,7.0,2.610000e+00,500,Free,0.0,Everyone,high,1.945910
10836,FAMILY,38.0,5.310000e+08,5000,Free,0.0,Everyone,high,3.637586
10837,FAMILY,4.0,3.610000e+00,100,Free,0.0,Everyone,high,1.386294
10839,BOOKS_AND_REFERENCE,114.0,,1000,Free,0.0,Mature 17+,high,4.736198


# 13. Encode the categorical columns.
 

In [46]:
from sklearn.preprocessing import LabelEncoder

In [47]:
le = LabelEncoder()

# Encode categorical columns
data_1['Category'] = le.fit_transform(data_1['Category'])
data_1['Type'] = le.fit_transform(data_1['Type'])
data_1['Content Rating'] = le.fit_transform(data_1['Content Rating'])

In [69]:
data_1.head()

Unnamed: 0,Category,Reviews,Size,Installs,Type,Price,Content Rating,Rating_category,Reviews_log
0,1,159.0,191000000.0,10000,1,0.0,1,high,5.068904
1,1,967.0,141000000.0,500000,1,0.0,1,high,6.874198
2,1,87510.0,8.71,5000000,1,0.0,1,high,11.379508
3,1,215644.0,251000000.0,50000000,1,0.0,4,high,12.281384
4,1,967.0,2.81,100000,1,0.0,1,high,6.874198


# 14. Segregate the target and independent features (Hint: Use Rating_category as the target)


In [48]:
X = data_1.drop(['Rating_category'], axis=1)
y = data_1['Rating_category']
data_1.head()

Unnamed: 0,Category,Reviews,Size,Installs,Type,Price,Content Rating,Rating_category,Reviews_log
0,1,159.0,191000000.0,10000,1,0.0,1,high,5.068904
1,1,967.0,141000000.0,500000,1,0.0,1,high,6.874198
2,1,87510.0,8.71,5000000,1,0.0,1,high,11.379508
3,1,215644.0,251000000.0,50000000,1,0.0,4,high,12.281384
4,1,967.0,2.81,100000,1,0.0,1,high,6.874198


# 15.Split the dataset into train and test

In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('x_train dataset',X_train)
print('x_test dataset',X_test)
print('y_train dataset',y_train)
print('y_test dataset',y_test)

x_train dataset       Category   Reviews          Size  Installs  Type    Price  \
8995        12   18298.0  8.010000e+00    500000     1   0.0000   
2368        21      54.0  1.710000e+08      5000     1   0.0000   
4473        24     147.0  1.610000e+08     10000     1   0.0000   
7102         7   17998.0  7.910000e+00   1000000     1   0.0000   
2268        21     104.0  3.810000e+08      1000     2  20.0909   
...        ...       ...           ...       ...   ...      ...   
6760         4      47.0  1.410000e+00      1000     2  30.0008   
6038         5      63.0  1.310000e+08      5000     1   0.0000   
6295        12  205914.0  5.410000e+08  10000000     1   0.0000   
1069        13   21996.0  1.410000e+08   1000000     1   0.0000   
8751        14   10225.0  3.610000e+08   1000000     1   0.0000   

      Content Rating  Reviews_log  
8995               1     9.814547  
2368               3     3.988984  
4473               1     4.990433  
7102               4     9.798016  

# 16. Standardize the data, so that the values are within a particular range

In [50]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [57]:
X_train_scaled

array([[-0.71068591, -0.16051489, -0.86761262, ..., -0.04316404,
        -0.46385036,  0.4111542 ],
       [ 0.37596395, -0.16738267, -0.18031368, ..., -0.04316404,
         1.52101065, -1.09324489],
       [ 0.73818057, -0.16734766, -0.2205066 , ..., -0.04316404,
        -0.46385036, -0.83462984],
       ...,
       [-0.71068591, -0.08988865,  1.30682446, ..., -0.04316404,
        -0.46385036,  1.03626957],
       [-0.58994704, -0.15912281, -0.30089245, ..., -0.04316404,
        -0.46385036,  0.45868832],
       [-0.46920817, -0.16355389,  0.58335185, ..., -0.04316404,
        -0.46385036,  0.26086929]])

In [51]:
X_test_scaled

array([[-0.34846929, -0.09153633,  1.86952538, ..., -0.04316404,
         2.51344116,  1.0307211 ],
       [-0.71068591, -0.13977077, -0.86761263, ..., -0.04316404,
        -0.46385036,  0.76989912],
       [ 0.49670282, -0.1673917 , -0.86761264, ..., -0.04316404,
        -0.46385036, -1.24503549],
       ...,
       [-0.58994704, -0.16734314, -0.30089245, ..., -0.04316404,
        -0.46385036, -0.81436525],
       [-0.71068591, -0.16489252,  0.38238724, ..., -0.04316404,
        -0.46385036,  0.15050594],
       [-0.71068591, -0.16739923,  2.07048999, ..., -0.04014027,
        -0.46385036, -1.5287422 ]])