MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.
The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. The data is widely used for collaborative filtering and other filtering solutions.
There is need to import 3 files from the folder as data frames into your Jupyter notebook
- u.data
- u.item
- u.user
- Display univariate plots of the attributes: 'rating', 'age', 'release date', 'gender' and 'occupation', from their respective data frames
- Visualize how popularity of Genres has changed over the years. From the graph one should be able to see for any given year, movies of which genre got released the most.
- Display the top 25 movies by average rating, as a list/series/dataframe. Note:- Consider only the movies which received atleast a 100 ratings
- Verify the following statements (no need of doing a statistical test. Compare absolute numbers):
- Men watch more drama than women
- Men watch more Romance than women
- Women watch more Sci-Fi than men
After getting the data the next thing to do is to get the data in it can be challenging because it was not just the normal csv file now it was in another form entirely.
item = pd.read_csv('u.item', names=['movie id','movie title' , 'release date' , 'video release date' , 'IMDb URL' , 'unknown' , 'Action' , 'Adventure' , 'Animation' , "Children's" , 'Comedy' , 'Crime' , 'Documentary' , 'Drama' , 'Fantasy' , 'Film-Noir' , 'Horror' , 'Musical' , 'Mystery' , 'Romance' , 'Sci-Fi' , 'Thriller' , 'War' , 'Western' ], sep='|', encoding='latin-1', header=None)
data=pd.read_table('u.data',names=['user id', 'item id', 'rating', 'timestamp'])
user = pd.read_csv('u.user',names=['user id' , 'age' , 'gender' , 'occupation' , 'zip code'], sep='|', encoding='latin-1', header=None)
After importing the data, I needed to clean it up so that it was usable for our analysis. I made the following changes and created the following variables:
- the datatype of each columns need to be corrected for example the columns that has to do with date has to be updated
- a new column has to be created or you can choose to rename a column that already exist but useless i went for option two
- The movie title looks awkward do i need to remove the year behind every title
- and the table need to be merged. i needed to normalize my data and to get that the item id is same as movie id in two tables so i changed it from item id to movie id
- and the univarate anylysis was done on a cleaned data
so after clening the item table i have this
i then need to merge the tables together
userdata=pd.merge(data,user, how="right")
userdata=pd.merge(userdata,item,how='right')
then we have userdata columns as
rating=userdata['rating'].value_counts().sort_values(ascending=False)
rating.values
plt.figure(figsize=(10,5))
sns.barplot(x=rating.index,y=rating.values)
plt.xlabel('Ratings', fontsize=12)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of Ratings')
plt.show()
age=userdata['age'].value_counts().sort_values(ascending=False)
age.values
plt.figure(figsize=(12,6))
sns.barplot(x=age.index,y=age.values)
plt.xlabel('age', fontsize=12)
plt.xticks(rotation=90)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of age')
plt.show()
release_date=item['release year'].value_counts().sort_values(ascending=False).head(20)
plt.figure(figsize=(12,6))
sns.barplot(x=release_date.index,y=release_date.values)
plt.xlabel('Years', fontsize=12)
plt.xticks(rotation=90)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of Movies Released per year')
gender=userdata['gender'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(8,4))
sns.barplot(x=gender.index,y=gender.values)
plt.xlabel('Gender', fontsize=12)
plt.xticks(rotation=90)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of Users by Gender')
plt.show()
occupation=userdata['occupation'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(12,6))
sns.barplot(x=occupation.index,y=occupation.values)
plt.xlabel('occupations', fontsize=12)
plt.xticks(rotation=90)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of Users by occupation')
Visualize how popularity of Genres has changed over the years. From the graph one should be able to see for any given year, movies of which genre got released the most.
genre_counts = userdata.groupby('release year').sum().loc[:, 'Action':'Western'].head(30)
genre_counts
genre_counts.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.xlabel('Release Year')
plt.ylabel('Number of Movies')
plt.title('Popularity of Genres Across Years')
Display the top 25 movies by average rating, as a list/series/dataframe. Note:- Consider only the movies which received atleast a 100 ratings
ratings = userdata.groupby('movie id')['rating'].agg(['count', 'mean'])
ratingabove100 = ratings[ratings['count'] >= 100]
top_movies = ratingabove100.sort_values('mean', ascending=False).head(25)
top_movies
drama = userdata.groupby(['gender', 'Drama'])['rating'].count()
drama
# when o is considered to be false 1 is true
f_drama=drama.loc['F',1]
m_drama=drama.loc['M',1]
print(f'The total number of female that watch drama {f_drama}, and for male {m_drama}, from our finding men watch drama genre than women')
romance = userdata.groupby(['gender', 'Romance'])['rating'].count()
romance
f_romance=romance.loc['F',1]
m_romance=romance.loc['M',1]
print(f'The total number of female that watch drama {f_romance}, and for male {m_romance}, from our finding men watch romance genre than women')
sci_fi = userdata.groupby(['gender', 'Sci-Fi'])['rating'].count()
sci_fi
f_sci_fi=sci_fi.loc['F',1]
m_sci_fi=sci_fi.loc['M',1]
print(f'The total number of female that watch drama {f_sci_fi}, and for male {m_sci_fi}, from our finding men watch Science Fictional genre than women')
genre_columns = ['Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
userdata['genre'] =userdata[genre_columns].apply(lambda x: '|'.join(x.index[x == 1]), axis=1)
userdata





