#Dataset description

This dataset records detailed information about ticket sales and customer behavior at a cinema hall, offering insights into various aspects such as demographics, movie genre preferences, seat selection, ticket pricing, and customer retention patterns. It is designed to help analyze customer engagement, spending behavior, and factors that influence repeat visits to the cinema. The data is useful for predictive modeling and can support decision-making processes related to customer retention, marketing strategies, and optimizing cinema operations.

Columns Overview:

**Ticket_ID (Categorical):**

Description: A unique alphanumeric identifier for each ticket purchase. The ID consists of a random uppercase letter (A-Z) followed by a 4-digit number (e.g., B7539, Y1344).
Significance: This column helps identify each individual transaction. It's a categorical variable, essential for tracking specific customer purchases but not related to other variables directly.

**Age (Numerical):**

Description: The age of the customer who purchased the ticket, ranging between 18 and 60 years.
Significance: Age is an important demographic feature, providing insights into customer segments. For example, younger audiences might prefer different movie genres or seating types compared to older customers. Analyzing age data can help cinema halls cater to the needs of various age groups.

**Ticket_Price (Numerical):**

Description: The price the customer paid for the ticket, typically ranging from $10 to $25. The price varies based on factors like movie time, seat type, or cinema location.
Significance: Ticket price reflects customer spending and the cinema's pricing strategy. Understanding how ticket pricing impacts customer behavior can help optimize ticket sales and maximize revenue.

**Movie_Genre (Categorical):**

Description: The genre of the movie the customer attended, which can include one of the following: Action, Comedy, Horror, Drama, or Sci-Fi.
Significance: Genre preferences are crucial for understanding customer interests. Analyzing which genres are most popular can guide movie scheduling, marketing strategies, and even help in curating personalized recommendations for customers.

**Seat_Type (Ordinal):**

Description: The type of seat selected by the customer, with three ordinal categories:
Standard (Basic seating option)
Premium (Enhanced seating with added comfort)
VIP (Exclusive seating, offering premium features like extra legroom and priority service)
Significance: Seat type provides insights into customer spending behavior. Premium and VIP seat types typically correlate with higher ticket prices, and understanding seat preferences can help in optimizing cinema layout and pricing strategies. Additionally, this column can be used to gauge the popularity of high-end seating options.

**Number_of_Person (Mixed Variable):**

Description: The number of people accompanying the customer. This can either be:
Alone: The customer attended alone.
2–7: The customer attended with a group of 2 to 7 people.
Significance: Group size is an important factor in understanding customer preferences and behavior. For example, groups might purchase more tickets or opt for different movie genres and seat types than solo attendees. This column is crucial for analyzing social dynamics and group behavior in cinema attendance.

**Purchase_Again (Target - Binary):**

Description: A binary target variable indicating whether the customer is likely to return and purchase another ticket. It has two possible values:
Yes: The customer is likely to return for another movie.
No: The customer is not likely to return.
Significance: This is the key column for predictive modeling. It is used to assess customer retention and predict the likelihood of future ticket purchases. Analyzing the factors that influence repeat purchases (e.g., age, genre preferences, seat types) helps cinema halls optimize marketing and customer engagement strategies.


#Task

## 1. Rename this notebook with your id
2. Load given dataset
3. Display dataset information and clean the data (print first 15 rows, show all column name, check for null, handle null values, show all unique values for non numeric column)
4. Map the non numeric colum using dictionary. Note, your target column is "Purchase_Again".

5. Apply Machine learning algorithms (SVM, DT and RF) and display their accuracy, precision, recall, confusin matrix. Note, you MUST write your findings after each code blocks. This is where you discuss regarding the reuslt you received.

6. Make a comparison table with all the results from different ML Algos.

7. Apply "grid search" to improve the achieved results for SVM, DT and random forest. Hint: you need to find what are the parameters to be tuned and what is their expected range. Accordingly, set 'grid search' parameters.

8. Suggest and use anyother parameter tuning technique that is more suitable that grid search for any of these ML Alogs. Justify your choice.

9. Finally, report the best models that you created. Justify why it is best. Carefully report required performance measures. Hints: Different performance measures may be selected for balanced vs unbalanced dataset.

# Importing Libraries

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV


# Loading Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/cinema_hall_ticket_sales.csv')

In [None]:
df.head(15)

Unnamed: 0,Ticket_ID,Age,Ticket_Price,Movie_Genre,Seat_Type,Number_of_Person,Purchase_Again
0,N4369,55,12.27,Comedy,Standard,7,No
1,B8091,35,19.02,Drama,Standard,Alone,Yes
2,V6341,55,22.52,Horror,VIP,3,No
3,B3243,53,23.01,Drama,Standard,6,Yes
4,I3814,30,21.81,Comedy,VIP,4,Yes
5,E5655,28,11.58,Horror,VIP,Alone,Yes
6,P1526,50,22.91,Action,Standard,Alone,Yes
7,V4726,44,23.09,Sci-Fi,Premium,7,Yes
8,A2029,46,12.12,Sci-Fi,Standard,Alone,Yes
9,P0092,48,19.63,Action,VIP,Alone,Yes


# Checking for Null Values

In [None]:
df.isnull().sum()

Unnamed: 0,0
Ticket_ID,0
Age,0
Ticket_Price,0
Movie_Genre,0
Seat_Type,0
Number_of_Person,0
Purchase_Again,0


# Checking if the Dataset is Imbalanced or not

In [None]:
# Check if the dataset is balanced or not
df['Purchase_Again'].value_counts()

Unnamed: 0_level_0,count
Purchase_Again,Unnamed: 1_level_1
No,733
Yes,707


In [None]:
df['Movie_Genre'].unique()

array(['Comedy', 'Drama', 'Horror', 'Action', 'Sci-Fi'], dtype=object)

In [None]:
df['Seat_Type'].unique()

array(['Standard', 'VIP', 'Premium'], dtype=object)

In [None]:
df['Number_of_Person'].unique()

array(['7', 'Alone', '3', '6', '4', '2', '5'], dtype=object)

# Mapping

In [None]:
mappings = {"Movie_Genre":{'Comedy': 3 , 'Drama': 2, 'Horror': 1, 'Action': 0, 'Sci-Fi': 4},
            "Seat_Type":{'Standard': 3 , 'VIP': 2, 'Premium': 1},
            "Number_of_Person":{'Alone': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6, '7': 7},
            "Purchase_Again":{'Yes': 1 , 'No': 0}
}

df.replace (mappings, inplace=True)

  df.replace (mappings, inplace=True)


# Selecting Features

In [None]:
X = df.drop(['Ticket_ID', 'Purchase_Again', 'Age', 'Number_of_Person'], axis=1)
y = df['Purchase_Again']

# Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#SVM

In [None]:
model = SVC(kernel='poly', gamma= 1, random_state=42, C = 50)
svm_predictions = model.fit(X_train, y_train)
y_pred_svc = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred_svc)
precision = precision_score(y_test, y_pred_svc, average = 'weighted')
recall = recall_score(y_test, y_pred_svc, average = 'weighted')
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")



Accuracy: 0.5555555555555556
Precision: 0.558709041635871
Recall: 0.5555555555555556


In [None]:
cm = confusion_matrix(y_test, y_pred_svc)
cm

array([[89, 52],
       [76, 71]])

# Decision Tree

In [None]:
model = DecisionTreeClassifier(criterion = 'log_loss', max_depth= 15, min_samples_split= 25)
dt_predictions = model
dt_predictions.fit(X_train, y_train)
y_pred_dt = dt_predictions.predict(X_test)
accuracy = accuracy_score(y_test, y_pred_dt)
precision = precision_score(y_test, y_pred_dt, average = 'weighted')
recall = recall_score(y_test, y_pred_dt, average = 'weighted')
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")



Accuracy: 0.5069444444444444
Precision: 0.5103392530709172
Recall: 0.5069444444444444


In [None]:
cm = confusion_matrix(y_test, y_pred_dt)
cm

array([[89, 52],
       [90, 57]])

# Random Forest

In [None]:
rf = RandomForestClassifier(n_estimators = 100, max_depth = 5, min_samples_split = 4, random_state=500)
rf.fit (X_train, y_train)
y_pred_rf = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred_rf)
precision = precision_score(y_test, y_pred_rf, average = 'weighted')
recall = recall_score(y_test, y_pred_rf, average = 'weighted')
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")



Accuracy: 0.5104166666666666
Precision: 0.5116003787878788
Recall: 0.5104166666666666


In [None]:
cm = confusion_matrix(y_test, y_pred_rf)
cm

array([[78, 63],
       [78, 69]])

In [None]:
results = [
    ["SVM", accuracy_score(y_test, y_pred_svc), precision_score(y_test, y_pred_svc, average='weighted'), recall_score(y_test, y_pred_svc, average='weighted')],
    ["Decision Tree", accuracy_score(y_test, y_pred_dt), precision_score(y_test, y_pred_dt, average='weighted'), recall_score(y_test, y_pred_dt, average='weighted')],
    ["Random Forest", accuracy_score(y_test, y_pred_rf), precision_score(y_test, y_pred_rf, average='weighted'), recall_score(y_test, y_pred_rf, average='weighted')],
]

df_results = pd.DataFrame(results, columns=["Model", "Accuracy", "Precision", "Recall"])

print(df_results)

           Model  Accuracy  Precision    Recall
0            SVM  0.555556   0.558709  0.555556
1  Decision Tree  0.506944   0.510339  0.506944
2  Random Forest  0.510417   0.511600  0.510417


# Grid search

# Grid Search on SVM

In [None]:
param_grid = {
    'C': [ 5, 10, 50],
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
    'gamma': ['scale', 'auto' , 0.1, 0.5, 1],

}

gird_search = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
gird_search.fit(X_train, y_train)

print("Best Parameters:", gird_search.best_params_)
print("Best Accuracy:", gird_search.best_score_)

Best Parameters: {'C': 50, 'gamma': 1, 'kernel': 'poly'}
Best Accuracy: 0.5468849990589121


# Grid Search on Decision Tree

In [None]:
param_grid = {
    'criterion': ['gini', 'entropy', 'log_loss'],
    'max_depth': [None ,35, 40, 50],
    'min_samples_split': [2, 3, 4],
    'random_state': [ 200, 300, 600]
}

grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)

Best Parameters: {'criterion': 'gini', 'max_depth': 35, 'min_samples_split': 2, 'random_state': 200}
Best Accuracy: 0.50866930171278


# Grid Search on Random Forest

In [None]:
param_grid = {
    'n_estimators': [ 200, 250,  500],
    'max_depth': [None, 5, 10,   35,],
    'min_samples_split': [ 5, 6, 7],

}


grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)

Best Parameters: {'max_depth': 5, 'min_samples_split': 7, 'n_estimators': 200}
Best Accuracy: 0.5243440617353661
