<a href="https://colab.research.google.com/github/harshalDharpure/APR-Assignment-_1-LR/blob/main/Logistic_regression_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Analyze the "Movie_collection_train.csv" dataset using Logistic Regression.

## Load the dataset

### Subtask:
Load the `Movie_collection_train.csv` file into a pandas DataFrame.


**Reasoning**:
Import pandas and load the dataframe.



In [None]:
import pandas as pd

df = pd.read_csv('/content/Movie_collection_train.csv')
df.head()

## Explore the data

### Subtask:
Perform initial data exploration to understand the structure, content, and basic statistics of the dataset. This may include checking for missing values, examining data types, and looking at the distribution of key variables.


**Reasoning**:
Perform initial data exploration by displaying the shape, data types, descriptive statistics, missing values, and unique values of the DataFrame.



In [None]:
print("Shape of the DataFrame:")
print(df.shape)

print("\nData types of each column:")
df.info()

print("\nDescriptive statistics for numerical columns:")
display(df.describe())

print("\nMissing values per column:")
print(df.isnull().sum())

print("\nUnique values and counts for categorical columns:")
for col in df.select_dtypes(include='object').columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts())

## Preprocess the data

### Subtask:
Prepare the data for the logistic regression model. This will involve handling missing values, encoding categorical features, and potentially scaling numerical features.


**Reasoning**:
Address missing values, encode categorical features, create the target variable, and scale numerical features to prepare the data for logistic regression.



In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
import numpy as np

# 1. Address missing values in `Time_taken`
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df['Time_taken'] = imputer.fit_transform(df[['Time_taken']])

# 2. and 3. Identify and encode categorical columns
# Drop MPAA_film_rating as it has only one unique value
df = df.drop('MPAA_film_rating', axis=1)

# One-hot encode 'Genre'
df = pd.get_dummies(df, columns=['Genre'], drop_first=True)

# 4. Encode binary categorical column '3D_available'
label_encoder = LabelEncoder()
df['3D_available'] = label_encoder.fit_transform(df['3D_available'])

# 6. Create a binary target variable 'Collection_Success'
median_collection = df['Collection'].median()
df['Collection_Success'] = (df['Collection'] > median_collection).astype(int)

# Drop the original 'Collection' column
df = df.drop('Collection', axis=1)

# 5. Scale numerical features
# Identify numerical columns (excluding the target variable and encoded/handled categoricals)
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
numerical_cols.remove('Collection_Success')
# Exclude one-hot encoded genre columns and the encoded 3D_available from scaling
cols_to_exclude_from_scaling = ['3D_available'] + [col for col in numerical_cols if col.startswith('Genre_')]
numerical_cols_to_scale = [col for col in numerical_cols if col not in cols_to_exclude_from_scaling]

scaler = StandardScaler()
df[numerical_cols_to_scale] = scaler.fit_transform(df[numerical_cols_to_scale])

display(df.head())

## Split the data

### Subtask:
Split the dataset into training and testing sets to evaluate the model's performance on unseen data.


**Reasoning**:
Split the preprocessed data into training and testing sets for model evaluation.



In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Collection_Success', axis=1)
y = df['Collection_Success']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

## Build and train the logistic regression model

### Subtask:
Initialize and train a Logistic Regression model on the training data.


**Reasoning**:
Initialize and train a Logistic Regression model on the training data.



In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

## Evaluate the model

### Subtask:
Evaluate the trained model's performance on the testing data using appropriate metrics (e.g., accuracy, precision, recall, F1-score).


**Reasoning**:
Use the trained model to make predictions on the test set and calculate the evaluation metrics.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate and print the evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

## Interpret the results

### Subtask:
Analyze the model's coefficients to understand the relationships between the features and the target variable.


**Reasoning**:
Access and analyze the coefficients of the trained logistic regression model to understand feature importance and direction of influence.



In [None]:
import pandas as pd

# 1. Access the coefficients of the trained logistic regression model
coefficients = model.coef_[0]

# 2. Create a pandas Series associating coefficients with feature names
feature_names = X_train.columns
coef_series = pd.Series(coefficients, index=feature_names)

# 3. Sort the coefficients by their absolute values
sorted_coef = coef_series.abs().sort_values(ascending=False)
sorted_coef_with_sign = coef_series[sorted_coef.index]

# 4. Print the sorted coefficients
print("Sorted Coefficients (by absolute value):")
display(sorted_coef_with_sign)

## Summary:

### Data Analysis Key Findings

*   The dataset contains 400 rows and 19 columns, including numerical and categorical data types.
*   The `Time_taken` column had 8 missing values, which were imputed using the mean.
*   The `MPAA_film_rating` column was dropped as it contained only one unique value.
*   The `Genre` column was one-hot encoded, while the `3D_available` column was label encoded.
*   A binary target variable `Collection_Success` was created, indicating whether the movie's collection was above the median.
*   Numerical features were scaled using `StandardScaler`.
*   The dataset was split into training (320 samples) and testing (80 samples) sets.
*   The trained Logistic Regression model achieved the following performance metrics on the test set:
    *   Accuracy: 0.8500
    *   Precision: 0.8684
    *   Recall: 0.8250
    *   F1-score: 0.8462
*   Features with the largest absolute coefficients, and thus the most significant influence on `Collection_Success`, include `Trailer_views`, `Budget`, `3D_available`, and `Movie_length`. The signs of the coefficients indicate the direction of the relationship (positive or negative) with the likelihood of successful collection.

### Insights or Next Steps

*   The model shows good predictive capability (85% accuracy) for classifying movie collection success based on the provided features.
*   Further investigation into the features with the highest coefficients could provide deeper insights into what drives movie collection success.


# Task
Analyze the dataset "Movie_collection_train.csv" using logistic regression, including data visualization, linear discriminant analysis, performance metrics, and a confusion matrix.

## Visualize the data

### Subtask:
Create visualizations to explore relationships between features and the target variable, and to understand data distributions.


**Reasoning**:
Create visualizations to explore the relationships between features and the target variable, and to understand data distributions based on the instructions.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Create a histogram of the original 'Collection' column
plt.figure(figsize=(10, 6))
sns.histplot(df['Collection_Success'], kde=True) # Use Collection_Success instead of Collection
plt.title('Distribution of Collection Success')
plt.xlabel('Collection Success (1: Above Median, 0: Below Median)')
plt.ylabel('Frequency')
plt.show()

# 2. Create a box plot of 'Collection' by '3D_available'
plt.figure(figsize=(10, 6))
sns.boxplot(x='3D_available', y='Collection_Success', data=df) # Use Collection_Success instead of Collection
plt.title('Collection Success by 3D Availability')
plt.xlabel('3D Available (1: Yes, 0: No)')
plt.ylabel('Collection Success')
plt.xticks([0, 1], ['No', 'Yes'])
plt.show()


# 3. Create scatter plots for numerical features against 'Collection_Success'
numerical_features = ['Marketin_expense', 'Production_expense', 'Budget', 'Trailer_views']
for feature in numerical_features:
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=feature, y='Collection_Success', data=df)
    plt.title(f'Collection Success vs {feature}')
    plt.xlabel(feature)
    plt.ylabel('Collection Success')
    plt.show()

# 4. Create bar plots for categorical features ('Genre' was one-hot encoded, 'MPAA_film_rating' was dropped)
# We can visualize the distribution of the one-hot encoded Genre columns
genre_cols = [col for col in df.columns if col.startswith('Genre_')]
for col in genre_cols:
    plt.figure(figsize=(8, 5))
    df[col].value_counts().plot(kind='bar')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=0)
    plt.show()

# Visualize the distribution of '3D_available' (which is now numerical)
plt.figure(figsize=(8, 5))
df['3D_available'].value_counts().plot(kind='bar')
plt.title('Distribution of 3D Availability')
plt.xlabel('3D Available (1: Yes, 0: No)')
plt.ylabel('Count')
plt.xticks([0, 1], ['No', 'Yes'], rotation=0)
plt.show()

## Perform linear discriminant analysis (lda)

### Subtask:
Apply LDA for dimensionality reduction or classification.


**Reasoning**:
Apply LDA for dimensionality reduction and transform the training and testing data.



In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Instantiate LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()

# Fit LDA model to the training data and transform
X_train_lda = lda.fit_transform(X_train, y_train)

# Transform the testing data
X_test_lda = lda.transform(X_test)

print("Shape of X_train_lda:", X_train_lda.shape)
print("Shape of X_test_lda:", X_test_lda.shape)

## Build and train the logistic regression model

### Subtask:
Initialize and train a Logistic Regression model on the training data.


**Reasoning**:
Initialize and train a Logistic Regression model on the LDA-transformed training data.



In [None]:
from sklearn.linear_model import LogisticRegression

model_lda = LogisticRegression()
model_lda.fit(X_train_lda, y_train)

## Evaluate the model

### Subtask:
Evaluate the trained model's performance on the testing data using appropriate metrics (e.g., accuracy, precision, recall, F1-score), including a confusion matrix.


**Reasoning**:
Use the trained LDA-based logistic regression model to make predictions on the transformed test set and calculate the evaluation metrics and the confusion matrix.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Make predictions on the test set
y_pred_lda = model_lda.predict(X_test_lda)

# Calculate and print the evaluation metrics
accuracy = accuracy_score(y_test, y_pred_lda)
precision = precision_score(y_test, y_pred_lda)
recall = recall_score(y_test, y_pred_lda)
f1 = f1_score(y_test, y_pred_lda)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

# Calculate and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_lda)
print("\nConfusion Matrix:")
print(conf_matrix)

## Interpret the results

### Subtask:
Analyze the model's coefficients to understand the relationships between the features and the target variable.


**Reasoning**:
Access and print the coefficient of the trained logistic regression model which was trained on LDA-transformed data.



In [None]:
# Access the coefficient of the trained logistic regression model
coefficient_lda = model_lda.coef_[0]

# Print the coefficient
print("Coefficient of the Logistic Regression model after LDA:")
print(coefficient_lda)

## Summarize observations and insights

### Subtask:
Summarize the key observations from the data exploration, visualization, model performance, and feature interpretation.


## Summary:

### Data Analysis Key Findings

*   The distribution of `Collection_Success` is roughly balanced between movies that collected above and below the median.
*   Visualizations suggest that while higher `Marketin_expense`, `Production_expense`, `Budget`, and `Trailer_views` are associated with a higher probability of collection success, there is significant overlap between the two classes.
*   Linear Discriminant Analysis (LDA) successfully reduced the dimensionality of the data to 1 for the binary classification task.
*   The logistic regression model trained on the LDA-transformed data achieved an accuracy of 0.8625 on the test set.
*   The model demonstrated a precision of 0.8718, a recall of 0.8500, and an F1-score of 0.8608.
*   The confusion matrix shows that the model correctly predicted 34 true positives and 35 true negatives, while making 5 false positive and 6 false negative predictions.
*   The logistic regression model trained on the single LDA component has a coefficient of approximately 2.04, indicating a positive relationship between the LDA component (which represents a combination of original features optimized for class separation) and the likelihood of collection success.

### Insights or Next Steps

*   Investigate the specific combination of original features that the single LDA component represents to gain a deeper understanding of which factors contribute most significantly to collection success according to the model.
*   Explore alternative dimensionality reduction techniques or feature selection methods to see if they yield improved model performance or more interpretable results.


# Movie Collection Analysis Report

## 1. Load the dataset

**Code:**

In [None]:
print("Shape of the DataFrame:")
print(df.shape)

print("\nData types of each column:")
df.info()

print("\nDescriptive statistics for numerical columns:")
display(df.describe())

print("\nMissing values per column:")
print(df.isnull().sum())

print("\nUnique values and counts for categorical columns:")
for col in df.select_dtypes(include='object').columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts())

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Create a histogram of the original 'Collection' column
plt.figure(figsize=(10, 6))
sns.histplot(df['Collection_Success'], kde=True)
plt.title('Distribution of Collection Success')
plt.xlabel('Collection Success (1: Above Median, 0: Below Median)')
plt.ylabel('Frequency')
plt.show()

# 2. Create a box plot of 'Collection' by '3D_available'
plt.figure(figsize=(10, 6))
sns.boxplot(x='3D_available', y='Collection_Success', data=df)
plt.title('Collection Success by 3D Availability')
plt.xlabel('3D Available (1: Yes, 0: No)')
plt.ylabel('Collection Success')
plt.xticks([0, 1], ['No', 'Yes'])
plt.show()


# 3. Create scatter plots for numerical features against 'Collection_Success'
numerical_features = ['Marketin_expense', 'Production_expense', 'Budget', 'Trailer_views']
for feature in numerical_features:
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=feature, y='Collection_Success', data=df)
    plt.title(f'Collection Success vs {feature}')
    plt.xlabel(feature)
    plt.ylabel('Collection Success')
    plt.show()

# 4. Create bar plots for categorical features ('Genre' was one-hot encoded, 'MPAA_film_rating' was dropped)
# We can visualize the distribution of the one-hot encoded Genre columns
genre_cols = [col for col in df.columns if col.startswith('Genre_')]
for col in genre_cols:
    plt.figure(figsize=(8, 5))
    df[col].value_counts().plot(kind='bar')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=0)
    plt.show()

# Visualize the distribution of '3D_available' (which is now numerical)
plt.figure(figsize=(8, 5))
df['3D_available'].value_counts().plot(kind='bar')
plt.title('Distribution of 3D Availability')
plt.xlabel('3D Available (1: Yes, 0: No)')
plt.ylabel('Count')
plt.xticks([0, 1], ['No', 'Yes'], rotation=0)
plt.show()

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
import numpy as np

# 1. Address missing values in `Time_taken`
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df['Time_taken'] = imputer.fit_transform(df[['Time_taken']])

# 2. and 3. Identify and encode categorical columns
# Drop MPAA_film_rating as it has only one unique value
# df = df.drop('MPAA_film_rating', axis=1) # This line caused the error

# One-hot encode 'Genre'
# df = pd.get_dummies(df, columns=['Genre'], drop_first=True) # This line caused the error

# 4. Encode binary categorical column '3D_available'
label_encoder = LabelEncoder()
df['3D_available'] = label_encoder.fit_transform(df['3D_available'])

# 6. Create a binary target variable 'Collection_Success'
# median_collection = df['Collection'].median() # This line caused the error
# df['Collection_Success'] = (df['Collection'] > median_collection).astype(int) # This line also caused the error

# Drop the original 'Collection' column
# df = df.drop('Collection', axis=1) # This line caused the error

# 5. Scale numerical features
# Identify numerical columns (excluding the target variable and encoded/handled categoricals)
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
numerical_cols.remove('Collection_Success')
# Exclude one-hot encoded genre columns and the encoded 3D_available from scaling
cols_to_exclude_from_scaling = ['3D_available'] + [col for col in numerical_cols if col.startswith('Genre_')]
numerical_cols_to_scale = [col for col in numerical_cols if col not in cols_to_exclude_from_scaling]

scaler = StandardScaler()
df[numerical_cols_to_scale] = scaler.fit_transform(df[numerical_cols_to_scale])

display(df.head())

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Collection_Success', axis=1)
y = df['Collection_Success']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Instantiate LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()

# Fit LDA model to the training data and transform
X_train_lda = lda.fit_transform(X_train, y_train)

# Transform the testing data
X_test_lda = lda.transform(X_test)

print("Shape of X_train_lda:", X_train_lda.shape)
print("Shape of X_test_lda:", X_test_lda.shape)

In [None]:
from sklearn.linear_model import LogisticRegression

model_lda = LogisticRegression()
model_lda.fit(X_train_lda, y_train)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Make predictions on the test set
y_pred_lda = model_lda.predict(X_test_lda)

# Calculate and print the evaluation metrics
accuracy = accuracy_score(y_test, y_pred_lda)
precision = precision_score(y_test, y_pred_lda)
recall = recall_score(y_test, y_pred_lda)
f1 = f1_score(y_test, y_pred_lda)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

# Calculate and print the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_lda)
print("\nConfusion Matrix:")
print(conf_matrix)

In [None]:
# Access the coefficient of the trained logistic regression model
coefficient_lda = model_lda.coef_[0]

# Print the coefficient
print("Coefficient of the Logistic Regression model after LDA:")
print(coefficient_lda)

## Summarize observations and insights

### Subtask:
Summarize the key observations from the data exploration, visualization, model performance, and feature interpretation.