# DATATHON@METUSTATCLUB Prequalification 2023!

### INTRODUCTION

In this datathon project you will be given an 18-month transaction dataset. You will start by dividing this dataset into 2 separate 9-month periods. You will create a churn model using the first 9-month period.

In the second step, you will create a churn variable. The first 9-month period will be used for this process. For example, users who were active in the first 6-month period will be identified and it will be determined whether these users churned in the last 3-month period. If there is a response imbalance, this problem should be solved before model creation.

Finally, using what you have learnt from the first 9-month period, you will use your churn model to predict whether the active customers in the dataset in the second 9-month period (for example, those active in the first 6 months of the second dataset) will churn in the last 3 months.

This project focuses on analysing the transaction dataset and predicting the probability of customer churn. Steps such as processing this data, building a model and interpreting the results are necessary to analyse the data.

#### STEPS:
- Read the documentation describing the dataset and the notes about the dataset.
- Load the dataset and examine the dataset to analyse the data.
- Analyse the size, columns, number of missing data and other statistical properties of the data set.
- Divide the data set into two separate 9-month periods.
- Churn variable is created. In the creation of this variable, users who are active in the first 6 months of the first 9-month period will be determined and it is determined whether churn has been performed in the last 3 months.
- Appropriate techniques are applied if necessary to solve the imbalance of the churn variable.
- Select an appropriate algorithm to create a churn model and train this model using it in the first 9-month period.
- Test the created churn model and evaluate its performance.
- Load the dataset in the second 9-month period and using the model, estimate the probability that users who were active in the first 6-month period will churn in the last 3 months.
- Test your predictions and evaluate the model performance
- Results are visualised and interpreted.

-------------------

### The libraries to be used are imported.

In [69]:
import pandas as pd
import numpy as np
from dateutil.relativedelta import relativedelta
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

        # Clsassification Models
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, ExtraTreeRegressor
from xgboost import XGBClassifier


        # Deep Learning
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


        # Testing
from sklearn.metrics import accuracy_score, f1_score, recall_score

        # Standart Scaler
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings("ignore")

-------------------

### Separated as Sales, Products, Customers. Separated excel files are read as DataFrame.

In [2]:
sales = pd.read_excel("sales23.xlsx")
customers = pd.read_excel("customers23.xlsx")
products = pd.read_excel("products23.xlsx")

-------------------

## EDA - Exploratory Data Analysis

- It is the process of understanding, exploring, and visualizing data. In this process, we will clean the data, characterize and perform statistical analysis, visualize the data and interpret the results.

In [3]:
sales.head(1)

Unnamed: 0,TransactionID,UserID,DateTime,ProductID,Channel,PaymentType,Price,Discount
0,1,500546547,2017-01-01 01:40:39.180,10334,MOBILE,Cash,51.0,No


In [4]:
products.head(1)

Unnamed: 0,ProductID,Category
0,10001,Female Shoes


In [5]:
customers.head(1)

Unnamed: 0,UserID,UserFirstTransaction,Gender,Location,Age
0,500234532,2011-10-12,FEMALE,ANTALYA,19


-------------------

### Based on the common UserID column in the Sales and Customers column, the ones we merge are merged with the ProductID column common to both.

In [6]:
merged_df = pd.merge(sales, customers, on='UserID')
merged_df.head(1)

Unnamed: 0,TransactionID,UserID,DateTime,ProductID,Channel,PaymentType,Price,Discount,UserFirstTransaction,Gender,Location,Age
0,1,500546547,2017-01-01 01:40:39.180,10334,MOBILE,Cash,51.0,No,2015-03-18,FEMALE,ANKARA,30


In [7]:
df = pd.merge(merged_df, products, on='ProductID')
df.head(1)

Unnamed: 0,TransactionID,UserID,DateTime,ProductID,Channel,PaymentType,Price,Discount,UserFirstTransaction,Gender,Location,Age,Category
0,1,500546547,2017-01-01 01:40:39.180,10334,MOBILE,Cash,51.0,No,2015-03-18,FEMALE,ANKARA,30,Female Shoes


-------------------

In [8]:
df.columns

Index(['TransactionID', 'UserID', 'DateTime', 'ProductID', 'Channel',
       'PaymentType', 'Price', 'Discount', 'UserFirstTransaction', 'Gender',
       'Location', 'Age', 'Category'],
      dtype='object')

In [9]:
df["Channel"].unique()

array(['MOBILE', 'WEB'], dtype=object)

In [10]:
df["PaymentType"].unique()

array(['Cash', 'Mobile Payment', 'Online Credit Card'], dtype=object)

In [11]:
df["Price"].unique()

array([ 51.   ,  21.   ,  30.   , ...,  34.425, 258.6  , 400.65 ])

In [12]:
df["Discount"].unique()

array(['No', 'Yes'], dtype=object)

In [13]:
df["Gender"].unique()

array(['FEMALE', 'MALE'], dtype=object)

In [14]:
df["Location"].unique()

array(['ANKARA', 'TRABZON', 'ESKISEHIR', 'KAYSERI', 'IZMIR', 'ANTALYA',
       'ISTANBUL', 'BURSA', 'ADANA'], dtype=object)

In [15]:
df["Age"].unique()

array([30, 28, 26, 41, 34, 43, 38, 31, 36, 32, 44, 33, 39, 21, 29, 23, 49,
       40, 22, 37, 48, 46, 50, 24, 19, 20, 27, 25, 35, 47, 42, 45],
      dtype=int64)

In [16]:
df["Category"].unique()

array(['Female Shoes', 'Female Fashion', 'Sport Shoes', 'Smart Phones',
       'Electronic Accessories', 'Kitchen Electronics',
       'Computers & Laptops', 'TVs and TV Sets', 'Male Shoes',
       'Outdoor Sports', 'Hobbies', 'Male Fashion', 'Sound Systems',
       'Smart Watches', 'Indoor Sports'], dtype=object)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 69059 entries, 0 to 69058
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   TransactionID         69059 non-null  int64         
 1   UserID                69059 non-null  int64         
 2   DateTime              69059 non-null  datetime64[ns]
 3   ProductID             69059 non-null  int64         
 4   Channel               69059 non-null  object        
 5   PaymentType           69059 non-null  object        
 6   Price                 69059 non-null  float64       
 7   Discount              69059 non-null  object        
 8   UserFirstTransaction  69059 non-null  datetime64[ns]
 9   Gender                69059 non-null  object        
 10  Location              69059 non-null  object        
 11  Age                   69059 non-null  int64         
 12  Category              69059 non-null  object        
dtypes: datetime64[ns

In [18]:
df.isnull().sum()

TransactionID           0
UserID                  0
DateTime                0
ProductID               0
Channel                 0
PaymentType             0
Price                   0
Discount                0
UserFirstTransaction    0
Gender                  0
Location                0
Age                     0
Category                0
dtype: int64

 - There are no columns with empty content. No filling is performed.

In [19]:
df.corr()

Unnamed: 0,TransactionID,UserID,ProductID,Price,Age
TransactionID,1.0,-0.008126,0.00995,0.042132,-0.009846
UserID,-0.008126,1.0,-0.007742,-0.004037,0.052687
ProductID,0.00995,-0.007742,1.0,-0.012901,0.012371
Price,0.042132,-0.004037,-0.012901,1.0,0.061357
Age,-0.009846,0.052687,0.012371,0.061357,1.0


## Feature Engineering

- It is the process of creating new features using existing features in the data set. In this process, existing features 	are manipulated or combined to make the data more meaningful, improve model performance and achieve better results.

In [20]:
df["Discount"]= df["Discount"].map({'No':0,'Yes':1})
df["Discount"].unique()

array([0, 1], dtype=int64)

---------------------------------

### All given data were analysed. It was made ready for processing.

---------------------------------

# Question 1: You are given an 18-month transactional data set. Please split this data set into two 9-month periods.

To divide the 18-month transaction dataset into two 9-month periods, you can follow the steps below:

- Determine the time range of the dataset by identifying the smallest and largest dates in the dataset.
- Add 9 months to the smallest date to determine the first 9-month period. This will be the cut-off date for the first period.
- For the first period, filter the transactions from the smallest date to the cut-off date
- For the second period, the transactions from the cut-off date to the largest date are filtered.

#### Read the dataset and get the '**DateTime**' column in the correct format.

In [21]:
df['DateTime'] = pd.to_datetime(df['DateTime'])

---------------------------------

#### The smallest and largest dates in the dataset are determined.

In [22]:
min_date = df['DateTime'].min()
max_date = df['DateTime'].max()

- Find the minimum (**min_date**) and maximum (**max_date**) dates in the dataset. This step determines the time range of the dataset.

---------------------------------

#### To determine the first 9-month period, 9 months are added to the smallest date.

In [23]:
cutoff_date_9_months = min_date + pd.DateOffset(months=9)

- The cutoff date between two periods is determined by adding 9 months to the smallest date (**cutoff_date_9_months**).

---------------------------------

#### For the first period, the transactions from the smallest date to the cut-off date are filtered.

In [24]:
first_period = df[df['DateTime'] <= cutoff_date_9_months]

- A DataFrame named '**first_period**' is created containing all transactions up to the cut-off date. This DataFrame represents the first 9 months of the dataset.

---------------------------------

#### For the second period, filter the transactions from the cut-off date to the largest date

In [25]:
second_period = df[df['DateTime'] > cutoff_date_9_months]

- A DataFrame named '**second_period**' is created containing all transactions after the interrupt date. This DataFrame represents the second 9-month period of the dataset.

---------------------------------

#### First and second semester data sets are checked.

In [26]:
first_period.head(1)

Unnamed: 0,TransactionID,UserID,DateTime,ProductID,Channel,PaymentType,Price,Discount,UserFirstTransaction,Gender,Location,Age,Category
0,1,500546547,2017-01-01 01:40:39.180,10334,MOBILE,Cash,51.0,0,2015-03-18,FEMALE,ANKARA,30,Female Shoes


In [27]:
second_period.head(1)

Unnamed: 0,TransactionID,UserID,DateTime,ProductID,Channel,PaymentType,Price,Discount,UserFirstTransaction,Gender,Location,Age,Category
3,33462,500338383,2017-11-05 15:09:01.390,10334,WEB,Mobile Payment,30.0,0,2014-06-19,FEMALE,TRABZON,28,Female Shoes


- By printing the first five lines of the two newly created DataFrames (**first_period** and **second_period**), we check that they are correctly divided into periods.

---------------------------------

**The above code splits the 18-month transaction dataset into two 9-month periods and stores these periods in two separate DataFrames. These steps can be used to use the dataset for purposes such as time series analysis or to study customer behaviour over different periods.**

---------------------------------

---------------------------------

# Question 2: Create churn variable from dataset. You are supposed to use the first 9-month duration to construct your model. For example, you should use the first 6-month duration for the active users and use the last 3-month period to determine if these active users churn within this 3-month period. (Please resolve if there is imbalance in response before modeling.)

This question asks you to create a "churn" variable from an existing dataset. Churn refers to when a customer stops using a service or product. The question asks you to use the first 9 months in the dataset to build your model.

Let's explain the steps as follows:

- Using the 'DateTime' column in the dataset, determine that the data is over a period of 9 months. You will use the first 6 months to identify active users, while the last 3 months are used to determine whether these users churn or not.

- Determine the users who were active during the first 6 months. To do this, make sure that each user (UserID) has made at least one transaction during this period. A list of these users is created.

- In the last 3-month period, the previously determined active users are checked for churn. If there are no transactions from a user in this period, this user can be considered as churn.

- Churn values are added to the dataset as a new 'Churn' column. This column will contain the values 0 (no churn) or 1 (churn) for each user.


#### Read the dataset and get the date columns in the correct format.

In [28]:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['UserFirstTransaction'] = pd.to_datetime(df['UserFirstTransaction'])

---------------------------------

#### The first 6 months and the last 3 months are determined.

In [29]:
cutoff_date_6_months = df['DateTime'].min() + pd.DateOffset(months=6)
cutoff_date_9_months = df['DateTime'].min() + pd.DateOffset(months=9)

 - For the first 6 months, a cut-off date is set by adding 6 months to the minimum date. Likewise, another cut-off date is set for the 9-month period by adding 9 months to the minimum date.

---------------------------------

#### Active users in the first 6 months are determined.

In [30]:
active_users = df[df['DateTime'] <= cutoff_date_6_months]['UserID'].unique()

- It takes a unique list of users who have made transactions within the first 6 months and assigns it to a list called "**active_users**".

---------------------------------

#### For active users, it is checked whether they have made transactions in the last 3 months.

In [31]:
churn_users = []
for user in active_users:
    user_transactions = df[df['UserID'] == user]
    last_transaction = user_transactions['DateTime'].max()
    if last_transaction <= cutoff_date_6_months:
        churn_users.append(user)

- In the list of active users, we start the loop and retrieve the transactions of each user. If the user's last transaction date is less than or equal to the 6 month cut-off date, this user is added to the "**churn_users**" list. This means that the user has not processed in the last 3 months and is considered churn.

---------------------------------

#### The Churn column is created.

In [32]:
df['Churn'] = np.where(df['UserID'].isin(churn_users), 1, 0)

- A new column called "**Churn**" is created. If the user is in the churn_users list, we assign the value 1 to this column; if not, we assign the value 0.

---------------------------------

#### The state of imbalance is checked.

In [33]:
churn_counts = df['Churn'].value_counts()
print(churn_counts)

0    67346
1     1713
Name: Churn, dtype: int64


- This situation indicates a class imbalance. That is, the proportion of users with churn is considerably lower than the proportion of users without churn. An imbalanced dataset means that your prediction model will tend to accurately predict non-churn users, but may struggle to predict churn users well.

## To address this imbalance, we use **Oversampling** to balance the dataset.

SMOTE (Synthetic Minority Over-sampling Technique) is used to stabilise the dataset.

In [34]:
x = df.drop(['Churn','TransactionID','UserID','DateTime','ProductID','UserFirstTransaction'],axis=1)
y = df['Churn']

- Churn, TransactionID, UserID, DateTime, ProductID, UserFirstTransaction columns have been deleted. Because date data and target column are deleted when making classification.

In [35]:
x = pd.get_dummies(x,drop_first=True)

- The remaining object columns do not contain values that are superior to each other. With the **get_dummies()** method, it is turned into a matrix without establishing dominance.

In [36]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [37]:
smote = SMOTE(random_state=42)
x_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)

In [38]:
pd.Series(y_train_resampled).value_counts()

0    53882
1    53882
Name: Churn, dtype: int64

---------------------------------

---------------------------------

## Question 3: Once you learn from the first 9-month period, proceed with the second 9-month data. Use your churn model obtained in (2) to predict if active customers (for example, those who were active in the first 6-month period of the second data set) churn in the last 3 months of this duration.

This question asks us to predict whether customers in the second 9-month period will churn, using the churn pattern learnt in the first 9-month period. For this purpose, the following steps are followed.

- Identify the active users in the second 9-month period
- Train the churn model (using the dataset previously balanced with SMOTE).
- For active users it is predicted whether they will churn or not.

#### A period of 6 months and a period of 3 months.

In [39]:
active_time_window = pd.DateOffset(months=6)
churn_time_window = pd.DateOffset(months=3)

---------------------------------

#### First 6 months of the second term

In [40]:
second_period_active_start = cutoff_date_9_months
second_period_active_end = second_period_active_start + active_time_window

---------------------------------

#### Last 3 months in the second semester

In [41]:
second_period_churn_start = second_period_active_end
second_period_churn_end = second_period_churn_start + churn_time_window

---------------------------------

#### Active users in the second period

In [42]:
active_users_second_period = second_period[(second_period['DateTime'] >= second_period_active_start) & (second_period['DateTime'] <= second_period_active_end)]['UserID'].unique()

---------------------------------

**In order to determine the best method, a method was created that trains and tests several important classification methods at once and freezes the Accuracy, F1 and Recall scores in a DataFrame.**

In [43]:
def algo_test(x,y):
    gauss = GaussianNB()
    kneClas = KNeighborsClassifier()
    svc = SVC()
    bernoulli = BernoulliNB()
    randForestClas= RandomForestClassifier()
    gradBoodClas = GradientBoostingClassifier()
    logReg = LogisticRegression()
    decTreeClas = DecisionTreeClassifier()
    xboost = XGBClassifier()
    
    x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
    smote = SMOTE(random_state=42)
    x_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)
    
    algos = [gauss,kneClas,svc,bernoulli,randForestClas,gradBoodClas,logReg,decTreeClas,xboost]
    algo_names = ["GaussianNB","KNeighborsClassifier","SVC","BernoulliNB","RandomForestClassifier","GradientBoostingClassifier","LogisticRegression","DecisionTreeClassifier","XGBClassifier"]
    ac_sc = []
    f1_sc = []
    rec_sc = []
    
    result = pd.DataFrame(columns = ["Accuracy_Score","F1_Score","Recall_Score"],index = algo_names)
    
    for algo in algos:
        algo.fit(x_train_resampled,y_train_resampled)
        ac_sc.append(accuracy_score(algo.predict(x_test),y_test))
        f1_sc.append(f1_score(algo.predict(x_test),y_test))
        rec_sc.append(recall_score(algo.predict(x_test),y_test))
        
    result.Accuracy_Score =ac_sc
    result.F1_Score = f1_sc
    result.Recall_Score = rec_sc
    return result.sort_values("Accuracy_Score", ascending=False)

In [44]:
algo_test(x,y) # Using method

Unnamed: 0,Accuracy_Score,F1_Score,Recall_Score
RandomForestClassifier,0.977483,0.464716,0.579399
DecisionTreeClassifier,0.965175,0.405439,0.355748
XGBClassifier,0.957573,0.334091,0.276316
GradientBoostingClassifier,0.890892,0.128398,0.080377
LogisticRegression,0.848972,0.057814,0.034298
KNeighborsClassifier,0.839632,0.151666,0.087494
BernoulliNB,0.814871,0.059581,0.034163
SVC,0.782942,0.072975,0.040887
GaussianNB,0.366783,0.052643,0.027353


--------------------

## Since RandomForestClassifier() gives the best result, we choose it for modelling.

In [63]:
model = RandomForestClassifier()
model.fit(x_train_resampled, y_train_resampled)

--------------------

#### Determine the property columns.

In [73]:
feature_columns = [
    'TransactionID', 'DateTime', 'ProductID', 'Channel', 'PaymentType',
    'Price', 'Discount', 'UserFirstTransaction', 'Gender', 'Location',
    'Age', 'Category'
]

--------------------

#### The properties are filtered for active users in the second semester.

In [74]:
x_active_users_raw = df[df['UserID'].isin(active_users_second_period)]

--------------------

#### Matrixing is applied according to the characteristics of active users in the second period.

In [75]:
x_active_users = pd.get_dummies(x_active_users_raw[feature_columns], drop_first=True)

--------------------

#### The column order is made the same as the training dataset.

In [76]:
x_active_users = x_active_users[x.columns]

--------------------

#### Churn predictions are made.

In [77]:
active_user_churn_predictions = model.predict(x_active_users)

---------------------------------

#### Forecasts for active users in the second period are added to the dataset.

In [78]:
x_active_users['Churn_Prediction'] = active_user_churn_predictions

---------------------------------

#### UserIDs are added back.

In [79]:
x_active_users['UserID'] = x_active_users_raw['UserID']

---------------------------------

#### Churn forecasts are shown.

In [80]:
k = x_active_users[['UserID', 'Churn_Prediction']]
k.head()

Unnamed: 0,UserID,Churn_Prediction
0,500546547,0
1,500546547,0
2,500338383,0
3,500338383,0
4,500338383,0


--------------------

--------------------

## As Alternative - Deep Learning With Tensorflow/Keras

In [47]:
model = Sequential()
model.add(Dense(1024, activation = "relu"))
model.add(Dense(512, activation = "relu"))
model.add(Dense(256, activation = "relu"))
model.add(Dense(128, activation = "relu"))
model.add(Dense(64, activation = "relu"))
model.add(Dense(32, activation = "relu"))
model.add(Dense(16, activation = "relu"))
model.add(Dense(8, activation = "relu"))
model.add(Dense(1, activation = "sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam" ,metrics="accuracy")
model.fit(x,y, epochs=100, batch_size=128,verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1b1e2e83370>

In [48]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_9 (Dense)             (None, 1024)              30720     
                                                                 
 dense_10 (Dense)            (None, 512)               524800    
                                                                 
 dense_11 (Dense)            (None, 256)               131328    
                                                                 
 dense_12 (Dense)            (None, 128)               32896     
                                                                 
 dense_13 (Dense)            (None, 64)                8256      
                                                                 
 dense_14 (Dense)            (None, 32)                2080      
                                                                 
 dense_15 (Dense)            (None, 16)               

In [49]:
model.evaluate(x,y)



[0.057105787098407745, 0.9830290079116821]

## The success rate in modelling with Deep Learning, which we used as an alternative, was 98%.

---------------------------------

#### Scaling properties

In [81]:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
x_active_users_scaled = scaler.transform(x_active_users[x.columns])

---------------------------------

#### Churn predictions are made

In [82]:
active_user_churn_predictions_dl = model.predict(x_active_users_scaled)

-----------------------

#### Predictions are converted into binary classification results (0 or 1).

In [83]:
active_user_churn_predictions_dl_binary = (active_user_churn_predictions_dl > 0.5).astype(int)

---------------------------------

#### Forecasts for active users in the second period are added to the dataset.

In [84]:
x_active_users['Churn_Prediction_DL'] = active_user_churn_predictions_dl_binary

---------------------------------

####  Churn predictions are shown.

In [85]:
z = x_active_users[['UserID', 'Churn_Prediction_DL']]
z.head()

Unnamed: 0,UserID,Churn_Prediction_DL
0,500546547,0
1,500546547,0
2,500338383,0
3,500338383,0
4,500338383,0


---------------------------------

# CONCLUSIONS:

In this project, we performed a customer churn analysis using 18 months of transactional data from a company. The main objective of this analysis was to predict customer churn and thus optimise the company's customer relationship management strategies.

The dataset contained different attributes such as transaction ID, user ID, date, product ID, payment channel and type, price, discount, user's first transaction date, gender, age and category. As a first step, we created the churn variable from the dataset. In this process, we determined whether active users in the first 9-month period churned in the following 3-month period.

After constructing the churn variable, we addressed the problem of unbalanced class distribution. To solve this problem, we used the SMOTE method, thus removing the imbalance between churn and non-churn instances in the training dataset.

Then, we split the dataset into two 9-month periods and use the churn model in the first period to predict whether active customers in the second period churn or not. These predictions can help the company to develop strategies to prevent customer churn.

In the last step, we evaluated the performance of different classification algorithms on the SMOTE-balanced dataset, including Gaussian Naive Bayes, K-Nearest Neighbour, Support Vector Machines, Bernoulli Naive Bayes, Random Forest, Gradient Boosting, Logistic Regression and Decision Trees. By comparing the performances of these algorithms, the most appropriate model can be selected to enable the company to perform churn predictions more accurately and reliably.

As a result of this project, by analysing customer churn and addressing the problem of unbalanced dataset, we have helped the company to predict customer churn. This information will help the business to improve customer relationship management strategies and take specific measures to prevent customer churn.