
<pre>
<center><b><h1>Machine Learning</b></center>

<center><b><h1>Lab - 3</b></center>¬†¬†¬†¬†
<pre>  

# üì± Lab: Scikit-Learn Fundamentals (Google Play Store)

**Objective:** Transition from manual data cleaning to automated Machine Learning preprocessing using Scikit-Learn.

**Prerequisites:**
* Ensure you have the `googleplaystore_cleaned.csv` file (from the previous lab) in this folder.

In [1]:
import pandas as pd

### 1. Load Preprocessed Data
**Instruction:** Load the dataset you cleaned in the previous lab. This dataset should already have `Installs`, `Price`, and `Reviews` converted to numbers.

In [2]:
df = pd.read_csv('googleplaystore_cleaned.csv', sep=',')
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000000.0,10000,Free,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite ‚Äì FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700000.0,5000000,Free,0.0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000000.0,50000000,Free,0.0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800000.0,100000,Free,0.0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


### Intro to Scikit-Learn
**What is Scikit-Learn?**
It is the standard library for Machine Learning in Python. We use it for:
1.  **Preprocessing:** Scaling numbers and encoding text.
2.  **Modeling:** Training algorithms.
3.  **Evaluation:** Checking accuracy.

**Task:** Import `sklearn` and check the version.

In [3]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [4]:
##__version__
print(sklearn.__version__)

1.7.2


### 3.  Train_Test_Split
**Concept:** We split data to prevent "Overfitting". The model learns from the **Train** set and is tested on the **Test** set.

**Task:** 
1. Define `X` (Features: everything except Rating/App) and `y` (Target: Rating).
2. Split the data (80% Train, 20% Test).

In [5]:
# train_test_split
x = df.drop(['Rating', 'App'], axis=1)
y = df['Rating']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print(f"Training shape: {x_train.shape}")
print(f"Testing shape: {x_test.shape}")

Training shape: (8277, 11)
Testing shape: (2070, 11)


### 4. üìè Scaling Numerical Data (StandardScaler)
**Concept:** `Installs` (Millions) are much larger than `Rating` (1-5). We scale them so the model treats them equally.

**Task:** Use `StandardScaler` on the numerical columns.

In [6]:
#StandardScaler()
num_cols = ['Reviews','Size','Installs','Price']

scaler = StandardScaler()
x_trained_scaled = scaler.fit_transform(x_train[num_cols])

x_trained_scaled

array([[-0.15153587, -0.8386026 , -0.17585395, -0.06144306],
       [-0.0922591 ,  3.11981093, -0.16337722, -0.06144306],
       [-0.15076336,  1.26430458, -0.1752313 , -0.06144306],
       ...,
       [-0.15153663, -0.76791665, -0.17585507, -0.06144306],
       [-0.14368972, -0.41448687, -0.16337722, -0.06144306],
       [-0.14980391,  0.95505353, -0.1746074 , -0.06144306]],
      shape=(8277, 4))

### 5. üî† Encoding Categorical Data
**Concept:** Models need numbers, not text like "Business" or "Teen".

**Method A: Pandas `get_dummies` (Simple)**

In [7]:
#get_dummies
dummies = pd.get_dummies(x_train['Content Rating'])
dummies


Unnamed: 0,Adults only 18+,Everyone,Everyone 10+,Mature 17+,Teen,Unrated
5708,False,True,False,False,False,False
5838,False,False,False,False,True,False
8141,False,False,False,False,True,False
4046,False,True,False,False,False,False
2474,False,True,False,False,False,False
...,...,...,...,...,...,...
5734,False,True,False,False,False,False
5191,False,False,False,True,False,False
5390,False,True,False,False,False,False
860,False,False,False,False,True,False


**Method B: Sklearn `OneHotEncoder` (Professional)**

In [8]:
from sklearn.preprocessing import OneHotEncoder

#OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')

cat_encoded = encoder.fit_transform(x_train[['Category']])
cat_encoded.shape

(8277, 33)

### 6. üöÄ The Full Pipeline: ColumnTransformer
**Concept:** Instead of doing steps 4 and 5 manually, we wrap them in one object.

**Task:** Create a `ColumnTransformer` that Scales numerical data AND Encodes categorical data at the same time.

In [9]:
from sklearn.compose import ColumnTransformer

In [10]:
numeric_features = ['Reviews', 'Size', 'Installs', 'Price']
categorical_features = ['Category', 'Content Rating']

In [11]:
# Create ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

In [12]:
from sklearn.pipeline import Pipeline

In [13]:
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

In [14]:
from sklearn import set_config
set_config(display='diagram')
pipeline

0,1,2
,steps,"[('preprocessor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [15]:
x_train = df[numeric_features + categorical_features]

In [16]:
x_processed = pipeline.fit_transform(x_train)

In [17]:
new_columns = pipeline.named_steps['preprocessor'].get_feature_names_out()
df_processed = pd.DataFrame(x_processed.toarray(), columns=new_columns)
df_processed.head()

Unnamed: 0,num__Reviews,num__Size,num__Installs,num__Price,cat__Category_ART_AND_DESIGN,cat__Category_AUTO_AND_VEHICLES,cat__Category_BEAUTY,cat__Category_BOOKS_AND_REFERENCE,cat__Category_BUSINESS,cat__Category_COMICS,...,cat__Category_TOOLS,cat__Category_TRAVEL_AND_LOCAL,cat__Category_VIDEO_PLAYERS,cat__Category_WEATHER,cat__Content Rating_Adults only 18+,cat__Content Rating_Everyone,cat__Content Rating_Everyone 10+,cat__Content Rating_Mature 17+,cat__Content Rating_Teen,cat__Content Rating_Unrated
0,-0.150536,-0.102315,-0.176414,-0.063335,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,-0.150237,-0.324102,-0.170309,-0.063335,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,-0.118159,-0.559195,-0.114251,-0.063335,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,-0.070666,0.163829,0.446334,-0.063335,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-0.150237,-0.820903,-0.175292,-0.063335,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
