**IMDb Score Prediction**

**Problem Statement**

The primary goal of this project is to develop a linear regression model that predicts IMDb scores for movies available on the Films platform. Accurate IMDb score predictions will enable users to discover highly-rated films that align with their preferences.

**Design Thinking Process**

* **Data Source**

The project utilized a dataset containing movie information, including genre, premiere date, runtime, language, and IMDb scores.

* **Phases of Development**

1. **Data Preprocessing**: **bold text** Cleaning data, handling missing values, and converting categorical features into numerical representations.

2. **Feature Engineering:** Extracting relevant features and transforming data for
predictive power.

3. **Model Selection:** Choosing linear regression as the primary algorithm for IMDb score prediction.

4. **Model Training:** Training the linear regression model using preprocessed data.
Evaluation: Assessing model performance using regression evaluation metrics.


**Dataset Description**

The dataset includes the following columns:

* Title: Movie title.
* Genre: Movie genre.
* Premiere: Movie premiere date.
* Runtime: Movie runtime in minutes.
* Language: Movie language.
* IMDb Score: The target variable, IMDb score.

In [None]:
data = pd.read_csv('NetflixOriginals.csv')

**Data Preprocessing**

Data preprocessing involved the following steps:

1. **Handling Missing Values:** Removed rows with missing values and imputed missing numerical values.

In [None]:
# Remove rows with missing values
dataset = dataset.dropna()

# Impute missing values
dataset['numerical_feature'].fillna(dataset['numerical_feature'].mean(), inplace=True)


2. **Categorical Feature Encoding:**
Encoded categorical variables using one-hot encoding.

In [None]:
# One-hot encoding for categorical variables
dataset = pd.get_dummies(dataset, columns=['genre', 'language'])

3. **Feature Engineering: **Extracted the release year from the premiere date

In [None]:
# Extract the release year from the premiere date
dataset['release_year'] = dataset['premiere_date'].dt.year

**Model Training**

Model training involved the following steps:

1. **Regression Algorithm Choice:** Linear Regression was selected as the primary algorithm due to its simplicity and interpretability.

In [None]:
from sklearn.linear_model import LinearRegression

# Create a Linear Regression model
model = LinearRegression()

# Train the model on the preprocessed data
model.fit(X_train, y_train)


**Choice of Regression Algorithm and Evaluation Metrics**

1. **Regression Algorithm:**Linear Regression was chosen as the primary algorithm because of its simplicity and interpretability. While we acknowledge its limitations, it serves as a strong baseline for IMDb score prediction.



2. **Evaluation Metrics:** For model assessment, we used the following regression metrics:

* **Mean Absolute Error (MAE):** To measure the average absolute error between predicted and actual IMDb scores.
* **Mean Squared Error (MSE):** To measure the average squared error between predicted and actual IMDb scores.

The choice of linear regression and these evaluation metrics provides a foundation for future model development and experimentation.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate MAE and MSE
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")

**Conclusion**

In conclusion, this project establishes a foundation for IMDb score prediction using a linear regression model. While linear regression serves as a strong starting point, there is ample room for further model exploration and optimization. The choice of evaluation metrics ensures robust model assessment and improvement. We anticipate future enhancements to this project, including the exploration of advanced regression techniques.