# Project: Analyzing Customer Churn in India's Telecom Sector

## Objective
Dive into India's telecom sector to analyze customer churn. Utilize pandas and machine learning to study datasets from top telecom firms, revealing demographic and usage patterns. Predict customer retention, merging data analysis and predictive modeling to sharpen your data science expertise.

## Project Instructions

1. **Load Data**
    - Load the two CSV files (`telecom_usage.csv` and `telecom_demographic.csv`) into separate DataFrames.
    - Merge them into a DataFrame named `churn_df`.

2. **Calculate Churn Rate and Identify Categorical Variables**
    - Calculate and print the churn rate.
    - Identify and print the categorical variables in `churn_df`.

3. **Feature Scaling**
    - Convert categorical features into numeric ones.
    - Perform feature scaling on the appropriate features.
    - Define your scaled features and target variable for the churn prediction model.

4. **Split Data**
    - Split the processed data into training and testing sets with an 80-20 split.
    - Name the sets `X_train`, `X_test`, `y_train`, and `y_test`.
    - Set a random state of 42 for reproducibility.

5. **Train Models**
    - Train Logistic Regression and Random Forest Classifier models.
    - Set a random seed of 42 for reproducibility.
    - Store model predictions in `logreg_pred` and `rf_pred`.

6. **Model Evaluation**
    - Assess the models on test data.
    - Assign the model's name with higher accuracy ("LogisticRegression" or "RandomForest") to `higher_accuracy`.

---

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

## Importing Dataset

In [8]:
telecom_demo = pd.read_csv('telecom_demographics.csv')
telecom_usage = pd.read_csv('telecom_usage.csv')

## Merging Data

In [10]:
churn_df = pd.merge(telecom_demo, telecom_usage, on='customer_id')


## Calculate Churn Rate

In [11]:
churn_rate = churn_df['churn'].mean()
print(f'Churn Rate: {churn_rate:.2f}')

Churn Rate: 0.20


## Identify Categorical Variables


In [12]:
categorical_vars = churn_df.select_dtypes(include=['object']).columns
print(f'Categorical Variables: {categorical_vars.tolist()}')

Categorical Variables: ['telecom_partner', 'gender', 'state', 'city', 'registration_event']


## Feature Scalling

In [13]:
numeric_features = ['calls_made', 'sms_sent', 'data_used', 'age', 'num_dependents', 'estimated_salary']
categorical_features = ['telecom_partner', 'gender', 'state', 'city', 'pincode', 'registration_event']


## Preprocessing Scalling

In [23]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

## Target and Features

In [24]:
X = churn_df.drop(columns=['customer_id', 'churn'])
y = churn_df['churn']

## Split Data

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Create Pipeline for Logistic Regression


In [26]:
logreg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('classifier', LogisticRegression(random_state=42))])



## Create Pipeline for Random Forest


In [27]:
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', RandomForestClassifier(random_state=42))])


## Train Models

In [28]:
logreg_pipeline.fit(X_train, y_train)
rf_pipeline.fit(X_train, y_train)

## Predictions

In [29]:
logreg_pred = logreg_pipeline.predict(X_test)
rf_pred = rf_pipeline.predict(X_test)

## Evaluate Models

In [30]:
logreg_accuracy = accuracy_score(y_test, logreg_pred)
rf_accuracy = accuracy_score(y_test, rf_pred)

In [31]:
print(f'Logistic Regression Accuracy: {logreg_accuracy:.2f}')
print(f'Random Forest Accuracy: {rf_accuracy:.2f}')


Logistic Regression Accuracy: 0.79
Random Forest Accuracy: 0.79


## Determining Higher Accuracy model

In [32]:
higher_accuracy = 'LogisticRegression' if logreg_accuracy > rf_accuracy else 'RandomForest'
print(f'Higher Accuracy Model: {higher_accuracy}')

Higher Accuracy Model: RandomForest
