<a href="https://colab.research.google.com/github/karim-mammadov/SaiKet_Systems_Tasks/blob/main/Task_1_Data_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Task 1:**

Data Preparation
Description:

In this task, you will be responsible for loading
the dataset and conducting an initial
exploration. Handle missing values, and if
necessary, convert categorical variables into
numerical representations. Furthermore, split
the dataset into training and testing sets for
subsequent model evaluation.

In [1]:
import pandas as pd

**Load the dataset from CSV**

In [2]:
df = pd.read_csv("/content/Telco_Customer_Churn_Dataset  (3).csv")  # replace with your dataset path
print("Initial shape:", df.shape)
print(df.head())

Initial shape: (7043, 21)
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingM

**Drop 'customerID' since it's unique and not useful**

In [3]:
df = df.drop("customerID", axis=1)  # CustomerID is unique and not useful
print("Dropped 'customerID' column.")

Dropped 'customerID' column.


**Convert 'TotalCharges' to numeric and fill missing values**

In [4]:
# Convert 'TotalCharges' to numeric (coerce errors to NaN)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [5]:
# Fill missing numeric values with median
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

In [7]:
# Fill missing categorical values with mode
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])
print("Handled missing values.")

Handled missing values.


**Convert categorical columns to numeric via one-hot encoding**

In [8]:
categorical_cols = [col for col in categorical_cols if col != 'Churn']  # exclude target

df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
print("Applied one-hot encoding to categorical variables.")

Applied one-hot encoding to categorical variables.


**Encode target column 'Churn' to binary (Yes=1, No=0)**

In [9]:
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
print("Converted target column 'Churn' to 0/1.")

Converted target column 'Churn' to 0/1.


**Split data into training and testing sets**

In [10]:
X = df.drop('Churn', axis=1)
y = df['Churn']

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [12]:
print("Split data into train and test sets.")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

Split data into train and test sets.
X_train shape: (5634, 30)
X_test shape: (1409, 30)
