
<pre>
<center><b><h1>Machine Learning - 2301CS621</b></center>
<center><b><h1>Lab - 3</b></center>¬†¬†¬†¬†
<pre>  

# üì± Lab: Scikit-Learn Fundamentals (Google Play Store)

**Objective:** Transition from manual data cleaning to automated Machine Learning preprocessing using Scikit-Learn.

**Prerequisites:**
* Ensure you have the `googleplaystore_cleaned.csv` file (from the previous lab) in this folder.

### 1. Load Preprocessed Data
**Instruction:** Load the dataset you cleaned in the previous lab. This dataset should already have `Installs`, `Price`, and `Reviews` converted to numbers.

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv('googleplaystore_cleaned.csv')
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000000.0,10000.0,Free,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000000.0,500000.0,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite ‚Äì FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700000.0,5000000.0,Free,0.0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000000.0,50000000.0,Free,0.0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800000.0,100000.0,Free,0.0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


### Intro to Scikit-Learn
**What is Scikit-Learn?**
It is the standard library for Machine Learning in Python. We use it for:
1.  **Preprocessing:** Scaling numbers and encoding text.
2.  **Modeling:** Training algorithms.
3.  **Evaluation:** Checking accuracy.

**Task:** Import `sklearn` and check the version.

In [None]:
import sklearn

1.5.1


In [9]:
##__version__
print(sklearn.__version__)

1.5.1


### 3.  Train_Test_Split
**Concept:** We split data to prevent "Overfitting". The model learns from the **Train** set and is tested on the **Test** set.

**Task:** 
1. Define `X` (Features: everything except Rating/App) and `y` (Target: Rating).
2. Split the data (80% Train, 20% Test).

In [14]:
from sklearn.model_selection import train_test_split
X=df.drop(columns=['Rating','App'],axis=1)
y=df['Rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

X_train shape: (7056, 11)
X_test shape: (1765, 11)


### 4. üìè Scaling Numerical Data (StandardScaler)
**Concept:** `Installs` (Millions) are much larger than `Rating` (1-5). We scale them so the model treats them equally.

**Task:** Use `StandardScaler` on the numerical columns.

In [None]:
#StandardScaler()
#fit_transform()

from sklearn.preprocessing import StandardScaler
num_cols=['Reviews', 'Size', 'Installs', 'Price']

scaler = StandardScaler()
X_train_scaled=scaler.fit_transform(X_train[num_cols])

print("Scale data sample:")
X_train_scaled

Scale data sample:


array([[-0.04165443, -0.65937412,  0.07693281, -0.06729974],
       [-0.14321967, -0.82894166, -0.15725882, -0.06729974],
       [-0.14237874, -0.49873119, -0.1549285 , -0.06729974],
       ...,
       [-0.14318712,  0.17061436, -0.15724711, -0.06729974],
       [-0.03250245,  1.59855152, -0.04016886, -0.06729974],
       [-0.1007897 ,  0.30448346, -0.1338502 , -0.06729974]])

### 5. üî† Encoding Categorical Data
**Concept:** Models need numbers, not text like "Business" or "Teen".

**Method A: Pandas `get_dummies` (Simple)**

In [22]:
#get_dummies
dummies=pd.get_dummies(X_train['Content Rating'])

dummies.head()

Unnamed: 0,Adults only 18+,Everyone,Everyone 10+,Mature 17+,Teen,Unrated
3254,False,True,False,False,False,False
4353,False,True,False,False,False,False
786,False,False,True,False,False,False
6149,False,False,False,False,True,False
449,False,False,False,False,True,False


**Method B: Sklearn `OneHotEncoder` (Professional)**

In [27]:
#OneHotEncoder
#fit_transform

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

cat_encoded=encoder.fit_transform(X_train[['Category']]).toarray()

print(f"Encoded shape: {cat_encoded.shape}")

Encoded shape: (7056, 33)


### 6. üöÄ The Full Pipeline: ColumnTransformer
**Concept:** Instead of doing steps 4 and 5 manually, we wrap them in one object.

**Task:** Create a `ColumnTransformer` that Scales numerical data AND Encodes categorical data at the same time.

In [33]:
from sklearn.compose import ColumnTransformer

In [34]:
numeric_features = ['Reviews', 'Size', 'Installs', 'Price']
categorical_features = ['Category', 'Content Rating']

In [36]:
# Create ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

In [37]:
from sklearn.pipeline import Pipeline
pipeline=Pipeline(steps=[('preprocessor',preprocessor)])

In [38]:
from sklearn import set_config
set_config(display='diagram')
pipeline

In [42]:
X=df[numeric_features+categorical_features]
X_preprocessed=pipeline.fit_transform(X)
print(f"Original Shape: {X_preprocessed.shape}")
print(f"Original Shape: {X.shape}")

Original Shape: (8821, 43)
Original Shape: (8821, 6)


In [40]:
new_columns=pipeline.named_steps['preprocessor'].get_feature_names_out()
df_preprocessed=pd.DataFrame(X_preprocessed.toarray(),columns=new_columns)
df_preprocessed.head()

Unnamed: 0,num__Reviews,num__Size,num__Installs,num__Price,cat__Category_ART_AND_DESIGN,cat__Category_AUTO_AND_VEHICLES,cat__Category_BEAUTY,cat__Category_BOOKS_AND_REFERENCE,cat__Category_BUSINESS,cat__Category_COMICS,...,cat__Category_TOOLS,cat__Category_TRAVEL_AND_LOCAL,cat__Category_VIDEO_PLAYERS,cat__Category_WEATHER,cat__Content Rating_Adults only 18+,cat__Content Rating_Everyone,cat__Content Rating_Everyone 10+,cat__Content Rating_Mature 17+,cat__Content Rating_Teen,cat__Content Rating_Unrated
0,-0.146291,-0.102315,-0.154446,-0.066729,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,-0.145787,-0.324102,-0.142936,-0.066729,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,-0.09179,-0.559195,-0.037234,-0.066729,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,-0.011842,0.163829,1.019785,-0.066729,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-0.145787,-0.820903,-0.152332,-0.066729,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
