Skip to content

mongodb-developer/Scikit-learn-Examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

#Retrieve the Movie Collection Data: Connect to your MongoDB Atlas sample data movie collection and retrieve the relevant dataset. This could include attributes like genre, #director, actors, release year, and user ratings.

Import the necessary libraries

from pymongo import MongoClient

Connect to MongoDB Atlas

client = MongoClient('<your_connection_string>')

Access the movie collection

db = client['sample_mflix']
movie_collection = db['movies']

Retrieve the relevant dataset

movie_data = movie_collection.find({}, {
    'genres': 1,
    'directors': 1,
    'cast': 1,
    'released': 1,
    'tomatoes': 1
})

#Replace <your_connection_string> with the actual connection string for your MongoDB Atlas cluster.

#Preprocess the Data: Before training the classification model, preprocess the data to ensure it is in a suitable format. This may involve handling missing values, encoding #categorical variables, and normalizing numeric features using techniques like one-hot encoding or feature scaling.

Preprocess the data (example: handling missing values)

for movie in movie_data:
    if 'tomatoes' not in movie:
        movie['tomatoes'] = 0.0

Modify the preprocessing steps based on the specific requirements of your dataset.

Split the Data: Split the dataset into training and testing subsets. The training set will be used to train the classification model, while the testing set will be used to evaluate its performance.

from sklearn.model_selection import train_test_split

Split the data into training and testing sets

X = [movie['genre'] + movie['director'] + movie['actors'] + [movie['release_year']] for movie in movie_data]
y = [movie['tomatoes'] for movie in movie_data]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Adjust the train-test split ratio (test_size) and random state (random_state) according to your needs.

Choose a Classification Algorithm: Select a suitable classification algorithm from scikit-learn, such as decision trees, random forests, or support vector machines. Each algorithm has its own strengths and considerations, so choose one that aligns with your specific classification task.

from sklearn.tree import DecisionTreeClassifier

Create a decision tree classifier

classifier = DecisionTreeClassifier()
You can replace DecisionTreeClassifier with the desired classification algorithm from scikit-learn.

#Train the Model: Fit the chosen classification algorithm to the training data. The model will learn the patterns and relationships between the movie features and their corresponding classes.

Train the model

classifier.fit(X_train, y_train)

Evaluate the Model: Use the testing set to evaluate the trained model's performance. Calculate metrics such as accuracy, precision, recall, and F1-score to assess how well the model predicts the movie classes.

from sklearn.metrics import classification_report

Make predictions on the testing set

y_pred = classifier.predict(X_test)

Evaluate the model's performance

report = classification_report(y_test, y_pred)
print(report)

Fine-tune the Model: Experiment with different hyperparameters of the classification algorithm to optimize the model's performance.

SciKit Learn Documentation

About

Demoing Scikit Learn on Movies DB in MongoDB Atlas

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages