## Movie Rating Prediction with Text Data

### Background

In this project, we used data from IMDB to classify users' movie ratings. We used text analysis to parse users' reviews of one hundred movies in order to generate predictions of each user's rating. A number of predictive models were employed to classify user ratings, including decision trees and random forests.

### Data Acquisition

Two separate APIs were used to collect IMDB movie data. First, the IMDB API was accessed via the Python IMDB client IMDB pie. This API was used to access information regarding the top hundred ranked movies in the IMDB database. Further, OMDB API was accessed to collect runtime and genre data for each movie. Data from each API was then joined into a single dataframe and pushed to a local PostgreSQL server.

### Text Data Preparation

Scikit-learn's TfidfVectorizer was used to parse the text data of each users' movie review. The TfidVectorizer is described in the documentation as a combination of the CountVectorizer and the TfidfTransformer, the latter of which transforms counts into weights signifying the importance of each token (word) in the corpus of documents.

The resulting output was a dataframe with 200 columns, each representing one of the top 200 ngrams in the corpus. Besides the ngram columns, the movie ID was retained to enable SQL joins, as well as the review's associated rating to enable learning algorithms.

### Preprocessing

The text data had to be stripped of non-alpha-numeric characters prior to performing the analysis. Text data was also converted to lowercase. Additionally, data was standardized using sklearn's StandardScaler tool.

It's also important to note that I binned my target into three categories:
- low score: 1-3
- medium score: 4-7
- high score: 8-10

### Model Building

Both decision trees and random forest classifiers were used to predict movie ratings in this analysis. First, a decision tree classifier was built and run using ten-fold cross-validation. The resulting mean accuracy score was __.59.__

Next, I performed a gridsearch to identify the optimal parameters and compare mean accruacy score with the original model. The resulting gridsearch identified a model with the following parameters: 

    DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
            max_features=50, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

The optimal model performed significantly better than the original tree, producing an accuracy score of __.83__.

Finally, I repeated the process with the random forest ensembling method. The gridsearch produced a score of also __.83__. Unfortunately due to the computation expense of running these models and issues with my notebook's kernel, I ended the analysis at this point.

### Final Takeaway

GridsearchCV drastically improved the performance of my decision tree models, however it should be noted that gridsearch may have caused my models to be overfit.