<h1>Kaggle Competition</h1>

<b>Important Note</b>: Try to make sure in each cell, if you are modifying/altering the variables, a <u>copy</u> of the previous cell's variables should be used instead. This is so that when something needs to be done again midway, you don't have to run the whole notebook from the beginning.

<h3>All Imports</h3>

In [3]:
import pandas as pd
import numpy as np
import re

<h2>Input</h2>

In [4]:
#movies.txtz
movies = pd.read_csv("./movies.txt", sep="::", names=["MovieID", "Title", "Genre"], engine = "python")
print(movies.shape)
movies.head()

(3883, 3)


Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
#Code reserve, for preprocessing.
dict_age = {"1": "Under 18",
            "18": "18-24",
            "25": "25-34",
            "35": "35-44",
            "45": "45-49",
            "50": "50-55",
            "56": "56+"
           }
dict_gender = {"M": "Male",
               "F": "Female"
              }
dict_occupation = {
    "0": "other or not specified",
    "1": "academic/educator",
    "2": "artist",
    "3": "clerical/admin",
    "4": "college/grad student",
    "5": "customer service",
    "6": "doctor/health care",
    "7": "executive/managerial",
    "8": "farmer",
    "9": "homemaker",
    "10": "K-12 student",
    "11": "lawyer",
    "12": "programmer",
    "13": "retired",
    "14": "sales/marketing",
    "15": "scientist",
    "16": "self-employed",
    "17": "technician/engineer",
    "18": "tradesman/craftsman",
    "19": "unemployed",
    "20": "writer"
}

In [None]:
#users.txt
users = pd.read_csv("./users.txt", sep="::", names=["UserID", "Gender", "Age", "Occupation", "Zip-code"], engine = "python")
print(users.shape)
users.head()

In [None]:
#training data
train = pd.read_csv("./training.txt", names=["UserID", "MovieID", "Rating", "Timestamp"]) #sep = ','
print(train.shape)
train.head()

In [None]:
#testing data
test = pd.read_csv("./testing.txt", names=["UserID", "MovieID", "Timestamp"]) #sep = ','
print(test.shape)
test.head()

<h2>Data Exploration and Manipulation</h2>

In this section, we explore the kinds of data, make meaningful (or meaningless) analysis, and do a bit of changing with the data. Preprocessing is for later, where we modify the structure of data and/or alter the rows in data to make it suitable for feeding to our prediction algorithms. Will use dummytrain variables as to not change the original dataframe, unless necessary.

<h3>Change Timestamp to Meaningful Format</h3>

In [None]:
dummytrain1 = train.copy()
dummytrain1['Timestamp'] = pd.to_datetime(train['Timestamp'], unit='s')
dummytrain1.head()

In [None]:
dummytrain2 = dummytrain1.copy()
dummytrain2['year'] = pd.DatetimeIndex(dummytrain2['Timestamp']).year
dummytrain2['month'] = pd.DatetimeIndex(dummytrain2['Timestamp']).month
dummytrain2['day'] = pd.DatetimeIndex(dummytrain2['Timestamp']).day
dummytrain2['hour'] = pd.DatetimeIndex(dummytrain2['Timestamp']).hour
dummytrain2 = dummytrain2.drop(columns = ['Timestamp'])
dummytrain2.head()

<h3>Seperating Genres, then one-hot encoding</h3>

In [8]:
movievariable = movies.copy()
#After turning the column 'Genre' into a list of genres, uncomment code below
'''
movies_genre = pd.concat([movievariable,pd.get_dummies(movievariable['Genre'].apply(pd.Series).stack()).sum(level=0)],axis=1)
movies_genre.drop('Genre',axis=1,inplace=True)
movies_genre.head()
'''
print()




<h3>List of movies unrated/unwatched</h3>

In [13]:
result_set = set(map(int, movies['MovieID'].unique().tolist()))
movies_set = set(range(1, 3883))
print(sorted(movies_set - result_set))

[91, 221, 323, 622, 646, 677, 686, 689, 740, 817, 883, 995, 1048, 1072, 1074, 1182, 1195, 1229, 1239, 1338, 1402, 1403, 1418, 1435, 1451, 1452, 1469, 1478, 1481, 1491, 1492, 1505, 1506, 1512, 1521, 1530, 1536, 1540, 1560, 1576, 1607, 1618, 1634, 1637, 1638, 1691, 1700, 1712, 1736, 1737, 1745, 1751, 1761, 1763, 1766, 1775, 1778, 1786, 1790, 1800, 1802, 1803, 1808, 1813, 1818, 1823, 1828, 1838, 3815]


<h3>[Insert additional data exploration here that seems needed for preprocessing]</h3>

In [None]:
#insert any more exploration codes and notes here

<h2>Preprocessing</h2>

In this section, we now alter the data and/or its structure.

<h3>Handle Incomplete Data</h3>

In [None]:
#insert code to handle missing data here, be it drop or alter

<h3>Handle Inconsistent Data</h3>

In [None]:
#insert code to handle data that is most likely incorrect, from result of Data Exploration

<h3>Handle Noise / Outliers</h3>

In [None]:
#insert extreme or disruptive data handling here, can use binning, or some other method.

<h3>[Insert appropriate preprocessing step here]</h3>

In [None]:
#insert one-hot encoding here, Feature Selection and Extraction, PCA, binning, etc

<h2>Model Training</h2>

It's time to select and train our algorithm(s) to make models that can predict the test dataset.

In [None]:
#insert all codes here, can use scikit-learn, can make own model

<h2>Evaluation and Analysis</h2>

From here onwards, we can only check how good our data is through cross-validation of our data with the training set. The test set is our final step and will be submitted to kaggle for scoring.

<h3>[Insert model choosing and training here]</h3>

In [None]:
#insert prediction and results here.

Make notes in Markdown as to why a certain model is chosen, if necessary.

<h3>Save Model</h3>

In [None]:
#saving model is important, so we don't have to train a new model when sharing .ipynb file to others.
#insert saving model method here.

<h3>Final Run</h3>

In [None]:
#predict test set, save results in .csv, with header of : ID,Predicted
#put code here

Extra notes:

<h1>EOF</h1>