MovieLens

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota.

Context:

The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. The data is widely used for collaborative filtering and other filtering solutions.

There is need to import 3 files from the folder as data frames into your Jupyter notebook

u.data
u.item
u.user

Task

Display univariate plots of the attributes: 'rating', 'age', 'release date', 'gender' and 'occupation', from their respective data frames
Visualize how popularity of Genres has changed over the years. From the graph one should be able to see for any given year, movies of which genre got released the most.
Display the top 25 movies by average rating, as a list/series/dataframe. Note:- Consider only the movies which received atleast a 100 ratings
Verify the following statements (no need of doing a statistical test. Compare absolute numbers):
Men watch more drama than women
Men watch more Romance than women
Women watch more Sci-Fi than men

References:

 https://movielens.org/

Data importation

After getting the data the next thing to do is to get the data in it can be challenging because it was not just the normal csv file now it was in another form entirely.

item = pd.read_csv('u.item', names=['movie id','movie title' , 'release date' , 'video release date' , 'IMDb URL' , 'unknown' , 'Action' , 'Adventure' , 'Animation' ,  "Children's" , 'Comedy' , 'Crime' , 'Documentary' , 'Drama' , 'Fantasy' ,  'Film-Noir' , 'Horror' , 'Musical' , 'Mystery' , 'Romance' , 'Sci-Fi' , 'Thriller' , 'War' , 'Western' ], sep='|', encoding='latin-1', header=None)

data=pd.read_table('u.data',names=['user id',  'item id',  'rating', 'timestamp'])

user = pd.read_csv('u.user',names=['user id' , 'age' , 'gender' , 'occupation' , 'zip code'], sep='|', encoding='latin-1', header=None)

Data Cleaning

After importing the data, I needed to clean it up so that it was usable for our analysis. I made the following changes and created the following variables:

the datatype of each columns need to be corrected for example the columns that has to do with date has to be updated
a new column has to be created or you can choose to rename a column that already exist but useless i went for option two
The movie title looks awkward do i need to remove the year behind every title
and the table need to be merged. i needed to normalize my data and to get that the item id is same as movie id in two tables so i changed it from item id to movie id
and the univarate anylysis was done on a cleaned data so after clening the item table i have this
the users table

the data table

i then need to merge the tables together

userdata=pd.merge(data,user, how="right")
userdata=pd.merge(userdata,item,how='right')

then we have userdata columns as

univariate plots of the attributes

Ratings

rating=userdata['rating'].value_counts().sort_values(ascending=False)
rating.values
plt.figure(figsize=(10,5))
sns.barplot(x=rating.index,y=rating.values)
plt.xlabel('Ratings', fontsize=12)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of Ratings')
plt.show()

age

age=userdata['age'].value_counts().sort_values(ascending=False)
age.values
plt.figure(figsize=(12,6))
sns.barplot(x=age.index,y=age.values)
plt.xlabel('age', fontsize=12)
plt.xticks(rotation=90)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of age')
plt.show()

Release Date

release_date=item['release year'].value_counts().sort_values(ascending=False).head(20)
plt.figure(figsize=(12,6))
sns.barplot(x=release_date.index,y=release_date.values)
plt.xlabel('Years', fontsize=12)
plt.xticks(rotation=90)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of Movies Released per year')

gender

gender=userdata['gender'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(8,4))
sns.barplot(x=gender.index,y=gender.values)
plt.xlabel('Gender', fontsize=12)
plt.xticks(rotation=90)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of Users by Gender')
plt.show()

occupation

occupation=userdata['occupation'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(12,6))
sns.barplot(x=occupation.index,y=occupation.values)
plt.xlabel('occupations', fontsize=12)
plt.xticks(rotation=90)
plt.ylabel('Counts', fontsize=12)
plt.title('Univarate Analysis of Users by occupation')

Visualize how popularity of Genres has changed over the years. From the graph one should be able to see for any given year, movies of which genre got released the most.

genre_counts = userdata.groupby('release year').sum().loc[:, 'Action':'Western'].head(30)
genre_counts
genre_counts.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.xlabel('Release Year')
plt.ylabel('Number of Movies')
plt.title('Popularity of Genres Across Years')

Display the top 25 movies by average rating, as a list/series/dataframe. Note:- Consider only the movies which received atleast a 100 ratings

ratings = userdata.groupby('movie id')['rating'].agg(['count', 'mean'])
ratingabove100 = ratings[ratings['count'] >= 100]
top_movies = ratingabove100.sort_values('mean', ascending=False).head(25)
top_movies

Verify the following statements (no need of doing a statistical test. Compare absolute numbers):

Men watch more drama than women

drama = userdata.groupby(['gender', 'Drama'])['rating'].count()
drama
# when o is considered to be false 1 is true
f_drama=drama.loc['F',1]
m_drama=drama.loc['M',1]
print(f'The total number of female that watch drama {f_drama}, and for male {m_drama}, from our finding men watch drama genre than women')

Men watch more Romance than women

romance = userdata.groupby(['gender', 'Romance'])['rating'].count()
romance
f_romance=romance.loc['F',1]
m_romance=romance.loc['M',1]
print(f'The total number of female that watch drama {f_romance}, and for male {m_romance}, from our finding men watch romance genre than women')

Women watch more Sci-Fi than men

sci_fi = userdata.groupby(['gender', 'Sci-Fi'])['rating'].count()
sci_fi
f_sci_fi=sci_fi.loc['F',1]
m_sci_fi=sci_fi.loc['M',1]
print(f'The total number of female that watch drama {f_sci_fi}, and for male {m_sci_fi}, from our finding men watch Science Fictional genre than women')

lastly i wanted to have a collation of all the genre of each film

genre_columns = ['Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
userdata['genre'] =userdata[genre_columns].apply(lambda x: '|'.join(x.index[x == 1]), axis=1)
userdata

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md
movie_lens_project2.ipynb		movie_lens_project2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MovieLens

Context:

Task

References:

 https://movielens.org/

Data importation

Data Cleaning

the users table

the data table

univariate plots of the attributes

Ratings

age

Release Date

gender

occupation

Visualize how popularity of Genres has changed over the years. From the graph one should be able to see for any given year, movies of which genre got released the most.

Display the top 25 movies by average rating, as a list/series/dataframe. Note:- Consider only the movies which received atleast a 100 ratings

Verify the following statements (no need of doing a statistical test. Compare absolute numbers):

Men watch more drama than women

Men watch more Romance than women

Women watch more Sci-Fi than men

lastly i wanted to have a collation of all the genre of each film

About

Uh oh!

Releases

Packages

Languages

iamasprout/MovieLens

Folders and files

Latest commit

History

Repository files navigation

MovieLens

Context:

Task

References:

 https://movielens.org/

Data importation

Data Cleaning

the users table

the data table

univariate plots of the attributes

Ratings

age

Release Date

gender

occupation

Visualize how popularity of Genres has changed over the years. From the graph one should be able to see for any given year, movies of which genre got released the most.

Display the top 25 movies by average rating, as a list/series/dataframe. Note:- Consider only the movies which received atleast a 100 ratings

Verify the following statements (no need of doing a statistical test. Compare absolute numbers):

Men watch more drama than women

Men watch more Romance than women

Women watch more Sci-Fi than men

lastly i wanted to have a collation of all the genre of each film

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages