
<pre>
<center><b><h1>Machine Learning</b></center>

<center><b><h1>Lab - 3</b></center>¬†¬†¬†¬†
<pre>  

# üì± Lab: Scikit-Learn Fundamentals (Google Play Store)

**Objective:** Transition from manual data cleaning to automated Machine Learning preprocessing using Scikit-Learn.

**Prerequisites:**
* Ensure you have the `googleplaystore_cleaned.csv` file (from the previous lab) in this folder.

### 1. Load Preprocessed Data
**Instruction:** Load the dataset you cleaned in the previous lab. This dataset should already have `Installs`, `Price`, and `Reviews` converted to numbers.

In [1]:
import pandas as pd
import numpy as np 


In [2]:
df = pd.read_csv('googleplaystore_cleaned.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,10000.0,10000.0,Free,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,500000.0,500000.0,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,2,"U Launcher Lite ‚Äì FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,5000000.0,5000000.0,Free,0.0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,50000000.0,50000000.0,Free,0.0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,100000.0,100000.0,Free,0.0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


### Intro to Scikit-Learn
**What is Scikit-Learn?**
It is the standard library for Machine Learning in Python. We use it for:
1.  **Preprocessing:** Scaling numbers and encoding text.
2.  **Modeling:** Training algorithms.
3.  **Evaluation:** Checking accuracy.

**Task:** Import `sklearn` and check the version.

In [3]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler , OneHotEncoder

In [4]:
##__version__
sklearn.__version__

'1.5.1'

### 3.  Train_Test_Split
**Concept:** We split data to prevent "Overfitting". The model learns from the **Train** set and is tested on the **Test** set.

**Task:** 
1. Define `X` (Features: everything except Rating/App) and `y` (Target: Rating).
2. Split the data (80% Train, 20% Test).

In [5]:
 x = df.drop(['Rating','App'] , axis=1)
 y = df['Rating']

In [6]:
# train_test_split
x_train,x_ttest,y_train,y_test = train_test_split(x,y, test_size=0.2,random_state=1)
print(x_train.shape)
print(x_ttest.shape)

(8276, 12)
(2070, 12)


### 4. üìè Scaling Numerical Data (StandardScaler)
**Concept:** `Installs` (Millions) are much larger than `Rating` (1-5). We scale them so the model treats them equally.

**Task:** Use `StandardScaler` on the numerical columns.

In [7]:
#StandardScaler()
num_col = ['Reviews','Size','Installs', 'Price']
scaler = StandardScaler()
x_train_scal = scaler.fit_transform(df[num_col])
#fit_transform()
x_train_scal = scaler.fit_transform(df[num_col])
x_train_scal

array([[-0.1505439 , -0.17642234, -0.17642234, -0.06333855],
       [-0.15024442, -0.17031848, -0.17031848, -0.06333855],
       [-0.11816828, -0.11426258, -0.11426258, -0.06333855],
       ...,
       [-0.15060172, -0.17653445, -0.17653445, -0.06333855],
       [-0.15056058, -0.17653445, -0.17653445, -0.06333855],
       [-0.00297503, -0.05197826, -0.05197826, -0.06333855]])

### 5. üî† Encoding Categorical Data
**Concept:** Models need numbers, not text like "Business" or "Teen".

**Method A: Pandas `get_dummies` (Simple)**

In [8]:
#get_dummies
dummies = pd.get_dummies(x_train['Content Rating'])
dummies

Unnamed: 0,Adults only 18+,Everyone,Everyone 10+,Mature 17+,Teen
9296,False,True,False,False,False
9562,False,False,True,False,False
2844,False,True,False,False,False
4410,False,True,False,False,False
255,False,True,False,False,False
...,...,...,...,...,...
2895,False,True,False,False,False
7813,False,True,False,False,False
905,False,True,False,False,False
5192,False,True,False,False,False


**Method B: Sklearn `OneHotEncoder` (Professional)**

In [9]:
#OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
cat_encoder = encoder.fit_transform(x_train[['Category']])
cat_encoder.shape
#fit_transform

(8276, 33)

### 6. üöÄ The Full Pipeline: ColumnTransformer
**Concept:** Instead of doing steps 4 and 5 manually, we wrap them in one object.

**Task:** Create a `ColumnTransformer` that Scales numerical data AND Encodes categorical data at the same time.

In [15]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [16]:
numeric_features = ['Reviews', 'Size', 'Installs', 'Price']
categorical_features = ['Category', 'Content Rating']

In [17]:
# Create ColumnTransformer
preprocessor = ColumnTransformer(
  transformers=[
      ('num' ,StandardScaler(), numeric_features),
      ('cat',OneHotEncoder(),categorical_features)
  ]  
)

In [18]:
pipeline = Pipeline(steps=[('preprocessor' , ColumnTransformer(
  transformers=[
      ('num' ,StandardScaler(), numeric_features),
      ('cat',OneHotEncoder(),categorical_features)
  ]  
))]  )


In [19]:
x = df[numeric_features+categorical_features]
x_pro = pipeline.fit_transform(x)

In [20]:
x.shape

(10346, 6)

In [21]:
x_pro.shape

(10346, 43)

In [23]:
new_col = pipeline.named_steps['preprocessor'].get_feature_names_out()
df_pro = pd.DataFrame(x_pro.toarray(),columns=new_col)
df_pro.head()

Unnamed: 0,num__Reviews,num__Size,num__Installs,num__Price,cat__Category_ART_AND_DESIGN,cat__Category_AUTO_AND_VEHICLES,cat__Category_BEAUTY,cat__Category_BOOKS_AND_REFERENCE,cat__Category_BUSINESS,cat__Category_COMICS,...,cat__Category_TOOLS,cat__Category_TRAVEL_AND_LOCAL,cat__Category_VIDEO_PLAYERS,cat__Category_WEATHER,cat__Content Rating_Adults only 18+,cat__Content Rating_Everyone,cat__Content Rating_Everyone 10+,cat__Content Rating_Mature 17+,cat__Content Rating_Teen,cat__Content Rating_Unrated
0,-0.150544,-0.176422,-0.176422,-0.063339,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,-0.150244,-0.170318,-0.170318,-0.063339,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,-0.118168,-0.114263,-0.114263,-0.063339,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,-0.070677,0.446296,0.446296,-0.063339,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-0.150244,-0.175301,-0.175301,-0.063339,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
