# Mini Project on Descriptive Analytics using file handling 

1. **Descriptive Analysis**

`Analyze the distribution of movie ratings. What percentage of movies have high (5), medium (3-4), and low (1-2) ratings?`

`Identify the top 10 most-rated movies`


2. **Genre Insights**

`Which movie genres are the most frequently rated?`

`Compare the average ratings across different genres. Are certain genres consistently rated higher or lower?`

3. **User Engagement Analysis**

`Identify the most active users (profession) based on the number of ratings they’ve given.`

`Analyze the relationship between user demographic attributes (age, gender, occupation) and their movie preferences or rating patterns.`


4. **Rating Distribution by Demographics**

`Investigate how ratings vary by user demographic attributes (age, gender, occupation).`

`Are there specific genres preferred by certain age groups or occupations?`


5. **Top Performers**

`Identify the movies with the highest average ratings (considering a minimum number of ratings for fairness).`

`Analyze the characteristics of top-rated movies (e.g., release year, genres).`


6. **Exploring Long Tail**

`Investigate the "long tail" of the dataset: How many movies receive very few ratings?`

`What are the characteristics of these less-rated movies compared to popular ones?`


7. **Tag Analysis**

`Analyze the tags associated with movies. What are the most frequently used tags?`

`Are tags consistent with movie genres?`

## You can do for self learning

8. **Visualization Projects**

`Create dashboards to visualize:`

`The distribution of ratings by genres and years.`

`Popular genres by user demographics.`

`Heatmaps showing the correlation between genres, user activity, and ratings`

In [1]:
# First Solution - Descriptive Analysys

ratings = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\ratings.dat")
rating_distribution = dict()
movieId_count = dict()

for line in ratings:
    line = line.strip()
    columns = list(map(int, line.split('::')))
    if columns[2] == 5:
        columns.append('High')
    elif columns[2] == 4 or columns[2] == 3:
        columns.append('Medium')
    else:
        columns.append('Low')

    if columns[1] in movieId_count:
        movieId_count[columns[1]] += 1
    else: 
        movieId_count[columns[1]] = 1

    if columns[4] in rating_distribution:
        rating_distribution[columns[4]] += 1
    else:
        rating_distribution[columns[4]] = 1

for rating_range in rating_distribution:
    print('{0} {1}'.format(rating_range, int(rating_distribution[rating_range] / sum(rating_distribution.values()) * 100)))

movies = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\movies.dat")
movieId_name = dict()
for line in movies:
    line = line.strip()
    columns = line.split('::')
    movieId_name[int(columns[0])] = columns[1]
    
sorted_counted_data = sorted(movieId_count.items(), key = lambda x:x[1], reverse=True)[:10]
for movieId, count in sorted_counted_data:
    print(movieId_name[movieId], count)

High 22
Medium 61
Low 16
American Beauty (1999) 3428
Star Wars: Episode IV - A New Hope (1977) 2991
Star Wars: Episode V - The Empire Strikes Back (1980) 2990
Star Wars: Episode VI - Return of the Jedi (1983) 2883
Jurassic Park (1993) 2672
Saving Private Ryan (1998) 2653
Terminator 2: Judgment Day (1991) 2649
Matrix, The (1999) 2590
Back to the Future (1985) 2583
Silence of the Lambs, The (1991) 2578


In [7]:
# Second Solution - Genre Insights
movies = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\movies.dat")
ratings = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\ratings.dat")

movieId_genres = dict()
genre_rating_count = dict()
genre_rating_sum = dict()

for line in movies:
    line = line.strip()
    columns = line.split('::')
    movie_id = int(columns[0])
    genres = columns[2].split('|')
    movieId_genres[movie_id] = genres

for line in ratings:
    line = line.strip()
    columns = list(map(int, line.split('::')))
    movie_id = columns[1]
    rating = columns[2]

    if movie_id in movieId_genres:
        genres = movieId_genres[movie_id]
        for genre in genres:
            if genre in genre_rating_count:
                genre_rating_count[genre] += 1
                genre_rating_sum[genre] += rating
            else:
                genre_rating_count[genre] = 1
                genre_rating_sum[genre] = rating

genre_avg_rating = {genre: genre_rating_sum[genre] / genre_rating_count[genre] for genre in genre_rating_count}
sorted_genres = sorted(genre_rating_count.items(), key=lambda x: x[1], reverse=True)

print("Most Frequently Rated Genres:")
for genre, count in sorted_genres:
    print(f"{genre}: {count} ratings")

print("\nThe average Ratings by Genre:")
for genre, avg_rating in sorted(genre_avg_rating.items(), key=lambda x: x[1], reverse=True):
    print(f"{genre}: {avg_rating:.2f}")


Most Frequently Rated Genres:
Comedy: 356580 ratings
Drama: 354529 ratings
Action: 257457 ratings
Thriller: 189680 ratings
Sci-Fi: 157294 ratings
Romance: 147523 ratings
Adventure: 133953 ratings
Crime: 79541 ratings
Horror: 76386 ratings
Children's: 72186 ratings
War: 68527 ratings
Animation: 43293 ratings
Musical: 41533 ratings
Mystery: 40178 ratings
Fantasy: 36301 ratings
Western: 20683 ratings
Film-Noir: 18261 ratings
Documentary: 7910 ratings

The average Ratings by Genre:
Film-Noir: 4.08
Documentary: 3.93
War: 3.89
Drama: 3.77
Crime: 3.71
Animation: 3.68
Mystery: 3.67
Musical: 3.67
Western: 3.64
Romance: 3.61
Thriller: 3.57
Comedy: 3.52
Action: 3.49
Adventure: 3.48
Sci-Fi: 3.47
Fantasy: 3.45
Children's: 3.42
Horror: 3.22


In [11]:
# Third solution
users = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\users.dat")
ratings = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\ratings.dat")

user_demographics = {}
occupation_rating_count = {}
user_rating_count = {}

for line in users:
    line = line.strip()
    user_id, gender, age, occupation, zip_code = line.split("::")
    user_demographics[int(user_id)] = {"gender": gender, "age": int(age), "occupation": int(occupation)}

for line in ratings:
    line = line.strip()
    user_id, movie_id, rating, timestamp = map(int, line.split("::"))
    
    if user_id in user_rating_count:
        user_rating_count[user_id] += 1
    else:
        user_rating_count[user_id] = 1
    
    if user_id in user_demographics:
        occupation = user_demographics[user_id]["occupation"]
        if occupation in occupation_rating_count:
            occupation_rating_count[occupation] += 1
        else:
            occupation_rating_count[occupation] = 1

most_active_occupations = sorted(occupation_rating_count.items(), key=lambda x: x[1], reverse=True)

print("Most Active Occupations (Ratings Given):")
for occupation, count in most_active_occupations:
    print(f"Occupation {occupation}")

gender_rating_count = {"M": 0, "F": 0}
age_rating_count = {}

for user_id, count in user_rating_count.items():
    if user_id in user_demographics:
        gender = user_demographics[user_id]["gender"]
        age = user_demographics[user_id]["age"]
        gender_rating_count[gender] += count
        
        if age in age_rating_count:
            age_rating_count[age] += count
        else:
            age_rating_count[age] = count

print("\nRatings by Gender:")
for gender, count in gender_rating_count.items():
    print(f"{gender}")

print("\nRatings by Age Group:")
for age, count in sorted(age_rating_count.items()):
    print(f"Age {age}")


Most Active Occupations (Ratings Given):
Occupation 4
Occupation 0
Occupation 7
Occupation 1
Occupation 17
Occupation 20
Occupation 12
Occupation 2
Occupation 14
Occupation 16
Occupation 6
Occupation 3
Occupation 10
Occupation 15
Occupation 5
Occupation 11
Occupation 19
Occupation 13
Occupation 18
Occupation 9
Occupation 8

Ratings by Gender:
M
F

Ratings by Age Group:
Age 1
Age 18
Age 25
Age 35
Age 45
Age 50
Age 56


In [1]:
# Forth solution - Rating Distribution by Demographics
movies = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\movies.dat")
users = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\users.dat")
ratings = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\ratings.dat")

movie_genres = {}
user_demographics = {}
demographic_ratings = {"age": {}, "gender": {}, "occupation": {}}
genre_preferences = {"age": {}, "gender": {}, "occupation": {}}

for line in movies:
    line = line.strip()
    movie_id, title, genres = line.split("::")
    movie_genres[int(movie_id)] = genres.split("|")

for line in users:
    line = line.strip()
    user_id, gender, age, occupation, zip_code = line.split("::")
    user_demographics[int(user_id)] = {"gender": gender, "age": int(age), "occupation": int(occupation)}

for line in ratings:
    line = line.strip()
    user_id, movie_id, rating, timestamp = map(int, line.split("::"))
    if user_id not in user_demographics or movie_id not in movie_genres:
        continue

    demo = user_demographics[user_id]
    gender = demo["gender"]
    age = demo["age"]
    occupation = demo["occupation"]
    genres = movie_genres[movie_id]

    if age not in demographic_ratings["age"]:
        demographic_ratings["age"][age] = []
    demographic_ratings["age"][age].append(rating)

    if gender not in demographic_ratings["gender"]:
        demographic_ratings["gender"][gender] = []
    demographic_ratings["gender"][gender].append(rating)

    if occupation not in demographic_ratings["occupation"]:
        demographic_ratings["occupation"][occupation] = []
    demographic_ratings["occupation"][occupation].append(rating)

    for genre in genres:
        if age not in genre_preferences["age"]:
            genre_preferences["age"][age] = {}
        if genre not in genre_preferences["age"][age]:
            genre_preferences["age"][age][genre] = 0
        genre_preferences["age"][age][genre] += 1

        if gender not in genre_preferences["gender"]:
            genre_preferences["gender"][gender] = {}
        if genre not in genre_preferences["gender"][gender]:
            genre_preferences["gender"][gender][genre] = 0
        genre_preferences["gender"][gender][genre] += 1

        if occupation not in genre_preferences["occupation"]:
            genre_preferences["occupation"][occupation] = {}
        if genre not in genre_preferences["occupation"][occupation]:
            genre_preferences["occupation"][occupation][genre] = 0
        genre_preferences["occupation"][occupation][genre] += 1

print("Average Ratings by Age:")
for age, ratings in demographic_ratings["age"].items():
    print(f"Age {age}: {sum(ratings) / len(ratings):.2f}")

print("\nAverage Ratings by Gender:")
for gender, ratings in demographic_ratings["gender"].items():
    print(f"{gender}: {sum(ratings) / len(ratings):.2f}")

print("\nAverage Ratings by Occupation:")
for occupation, ratings in demographic_ratings["occupation"].items():
    print(f"Occupation {occupation}: {sum(ratings) / len(ratings):.2f}")

print("\nMost Preferred Genres by Age:")
for age, genres in genre_preferences["age"].items():
    top_genre = max(genres.items(), key=lambda x: x[1])
    print(f"Age {age}: {top_genre[0]} ({top_genre[1]} ratings)")

print("\nMost Preferred Genres by Gender:")
for gender, genres in genre_preferences["gender"].items():
    top_genre = max(genres.items(), key=lambda x: x[1])
    print(f"{gender}: {top_genre[0]} ({top_genre[1]} ratings)")

print("\nMost Preferred Genres by Occupation:")
for occupation, genres in genre_preferences["occupation"].items():
    top_genre = max(genres.items(), key=lambda x: x[1])
    print(f"Occupation {occupation}: {top_genre[0]} ({top_genre[1]} ratings)")


Average Ratings by Age:
Age 1: 3.55
Age 56: 3.77
Age 25: 3.55
Age 45: 3.64
Age 50: 3.71
Age 35: 3.62
Age 18: 3.51

Average Ratings by Gender:
F: 3.62
M: 3.57

Average Ratings by Occupation:
Occupation 10: 3.53
Occupation 16: 3.60
Occupation 15: 3.69
Occupation 7: 3.60
Occupation 20: 3.50
Occupation 9: 3.66
Occupation 1: 3.58
Occupation 12: 3.65
Occupation 17: 3.61
Occupation 0: 3.54
Occupation 3: 3.66
Occupation 14: 3.62
Occupation 4: 3.54
Occupation 11: 3.62
Occupation 8: 3.47
Occupation 19: 3.41
Occupation 2: 3.57
Occupation 18: 3.53
Occupation 5: 3.54
Occupation 13: 3.78
Occupation 6: 3.66

Most Preferred Genres by Age:
Age 1: Comedy (11162 ratings)
Age 56: Drama (17269 ratings)
Age 25: Comedy (143210 ratings)
Age 45: Drama (32141 ratings)
Age 50: Drama (29247 ratings)
Age 35: Drama (71590 ratings)
Age 18: Comedy (69980 ratings)

Most Preferred Genres by Gender:
F: Drama (98153 ratings)
M: Comedy (260309 ratings)

Most Preferred Genres by Occupation:
Occupation 10: Comedy (9465 rati

In [3]:
# Fifth solution - Top performers
movies = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\movies.dat")
ratings = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\ratings.dat")

movie_info = {}
movie_ratings = {}

# Parse movies file to get movie information
for line in movies:
    line = line.strip()
    movie_id, title, genres = line.split("::")
    release_year = int(title.strip()[-5:-1]) if title.strip()[-5:-1].isdigit() else None
    movie_info[int(movie_id)] = {"title": title, "year": release_year, "genres": genres.split("|")}

# Parse ratings file to calculate ratings for each movie
for line in ratings:
    line = line.strip()
    user_id, movie_id, rating, timestamp = map(int, line.split("::"))
    
    if movie_id not in movie_ratings:
        movie_ratings[movie_id] = {"total_rating": 0, "rating_count": 0}
    
    movie_ratings[movie_id]["total_rating"] += rating
    movie_ratings[movie_id]["rating_count"] += 1

# Minimum ratings threshold
min_ratings = 50

# Calculate average ratings for movies with enough ratings
movie_avg_ratings = {}
for movie_id, data in movie_ratings.items():
    if data["rating_count"] >= min_ratings:
        avg_rating = data["total_rating"] / data["rating_count"]
        movie_avg_ratings[movie_id] = avg_rating

# Get top-rated movies
top_movies = sorted(movie_avg_ratings.items(), key=lambda x: x[1], reverse=True)[:10]

print("Top 10 Movies by Average Rating:")
for movie_id, avg_rating in top_movies:
    info = movie_info[movie_id]
    print(f"{info['title']} ({info['year']}): {avg_rating:.2f}")

# Analyze characteristics of top-rated movies
genre_count = {}
year_count = {}

for movie_id, avg_rating in top_movies:
    info = movie_info[movie_id]
    genres = info["genres"]
    year = info["year"]

    for genre in genres:
        if genre not in genre_count:
            genre_count[genre] = 0
        genre_count[genre] += 1

    if year:
        if year not in year_count:
            year_count[year] = 0
        year_count[year] += 1

print("\nTop Genres Among Top Movies:")
for genre, count in sorted(genre_count.items(), key=lambda x: x[1], reverse=True):
    print(f"{genre}: {count} movies")

print("\nRelease Years of Top Movies:")
for year, count in sorted(year_count.items()):
    print(f"{year}: {count} movies")


Top 10 Movies by Average Rating:
Sanjuro (1962) (1962): 4.61
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) (1954): 4.56
Shawshank Redemption, The (1994) (1994): 4.55
Godfather, The (1972) (1972): 4.52
Close Shave, A (1995) (1995): 4.52
Usual Suspects, The (1995) (1995): 4.52
Schindler's List (1993) (1993): 4.51
Wrong Trousers, The (1993) (1993): 4.51
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) (1950): 4.49
Raiders of the Lost Ark (1981) (1981): 4.48

Top Genres Among Top Movies:
Action: 4 movies
Drama: 4 movies
Adventure: 2 movies
Crime: 2 movies
Animation: 2 movies
Comedy: 2 movies
Thriller: 2 movies
War: 1 movies
Film-Noir: 1 movies

Release Years of Top Movies:
1950: 1 movies
1954: 1 movies
1962: 1 movies
1972: 1 movies
1981: 1 movies
1993: 2 movies
1994: 1 movies
1995: 2 movies


In [5]:
#sixth solution - long tail 
#used to refer to the large number of products that sell in small quantities, as contrasted with the small number of best-selling products.


movies = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\movies.dat")
ratings = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\ratings.dat")

movie_info = {}
movie_ratings = {}


for line in movies:
    line = line.strip()
    movie_id, title, genres = line.split("::")
    release_year = int(title.strip()[-5:-1]) if title.strip()[-5:-1].isdigit() else None
    movie_info[int(movie_id)] = {"title": title, "year": release_year, "genres": genres.split("|")}

for line in ratings:
    line = line.strip()
    user_id, movie_id, rating, timestamp = map(int, line.split("::"))
    
    if movie_id not in movie_ratings:
        movie_ratings[movie_id] = {"total_rating": 0, "rating_count": 0}
    
    movie_ratings[movie_id]["total_rating"] += rating
    movie_ratings[movie_id]["rating_count"] += 1

rating_counts = [data["rating_count"] for data in movie_ratings.values()]
threshold_popular = sorted(rating_counts, reverse=True)[int(0.1 * len(rating_counts))]  # Top 10%
threshold_less_rated = sorted(rating_counts)[int(0.1 * len(rating_counts))]  # Bottom 10%

popular_movies = []
less_rated_movies = []

for movie_id, data in movie_ratings.items():
    if data["rating_count"] >= threshold_popular:
        popular_movies.append(movie_id)
    elif data["rating_count"] <= threshold_less_rated:
        less_rated_movies.append(movie_id)


def analyze_group(movie_ids, group_name):
    genre_count = {}
    year_count = {}
    avg_ratings = []

    for movie_id in movie_ids:
        info = movie_info[movie_id]
        genres = info["genres"]
        year = info["year"]
        data = movie_ratings[movie_id]
        avg_rating = data["total_rating"] / data["rating_count"]

        avg_ratings.append(avg_rating)

        for genre in genres:
            if genre not in genre_count:
                genre_count[genre] = 0
            genre_count[genre] += 1

        if year:
            if year not in year_count:
                year_count[year] = 0
            year_count[year] += 1
    print("Long Tail of the dataset")
    print(f"\n{group_name} Movies Analysis:")
    print(f"Average Rating: {sum(avg_ratings) / len(avg_ratings):.2f}")
    print("Top Genres:")
    for genre, count in sorted(genre_count.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"  {genre}: {count} movies")
    print("Top Release Years:")
    for year, count in sorted(year_count.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"  {year}: {count} movies")

analyze_group(popular_movies, "Popular")
analyze_group(less_rated_movies, "Less-Rated")

Long Tail of the dataset

Popular Movies Analysis:
Average Rating: 3.78
Top Genres:
  Comedy: 144 movies
  Action: 131 movies
  Drama: 121 movies
  Thriller: 86 movies
  Sci-Fi: 77 movies
Top Release Years:
  1999: 36 movies
  1997: 27 movies
  1998: 25 movies
  1995: 19 movies
  1994: 18 movies
Long Tail of the dataset

Less-Rated Movies Analysis:
Average Rating: 2.86
Top Genres:
  Drama: 204 movies
  Comedy: 94 movies
  Documentary: 34 movies
  Thriller: 29 movies
  Romance: 25 movies
Top Release Years:
  1995: 53 movies
  1998: 48 movies
  1996: 46 movies
  1997: 41 movies
  1994: 30 movies


In [7]:
# Seventh solution - tag analysis 
movies = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\movies.dat", encoding="ISO-8859-1")
ratings = open(r"C:\Users\nimis\Downloads\ml-1m\ml-1m\ratings.dat", encoding="ISO-8859-1")

movie_info = {}
genre_stats = {}

for line in movies:
    line = line.strip()
    movie_id, title, genres = line.split("::")
    genres = genres.split("|")
    movie_info[int(movie_id)] = {"title": title, "genres": genres}

    for genre in genres:
        if genre not in genre_stats:
            genre_stats[genre] = {"total_rating": 0, "rating_count": 0}

for line in ratings:
    line = line.strip()
    user_id, movie_id, rating, timestamp = map(int, line.split("::"))

    if movie_id in movie_info:
        genres = movie_info[movie_id]["genres"]
        for genre in genres:
            genre_stats[genre]["total_rating"] += rating
            genre_stats[genre]["rating_count"] += 1

print("Genre Statistics:")
for genre, stats in genre_stats.items():
    avg_rating = stats["total_rating"] / stats["rating_count"] if stats["rating_count"] > 0 else 0
    print(f"{genre}: {stats['rating_count']} ratings, Avg Rating: {avg_rating:.2f}")


Genre Statistics:
Animation: 43293 ratings, Avg Rating: 3.68
Children's: 72186 ratings, Avg Rating: 3.42
Comedy: 356580 ratings, Avg Rating: 3.52
Adventure: 133953 ratings, Avg Rating: 3.48
Fantasy: 36301 ratings, Avg Rating: 3.45
Romance: 147523 ratings, Avg Rating: 3.61
Drama: 354529 ratings, Avg Rating: 3.77
Action: 257457 ratings, Avg Rating: 3.49
Crime: 79541 ratings, Avg Rating: 3.71
Thriller: 189680 ratings, Avg Rating: 3.57
Horror: 76386 ratings, Avg Rating: 3.22
Sci-Fi: 157294 ratings, Avg Rating: 3.47
Documentary: 7910 ratings, Avg Rating: 3.93
War: 68527 ratings, Avg Rating: 3.89
Musical: 41533 ratings, Avg Rating: 3.67
Mystery: 40178 ratings, Avg Rating: 3.67
Film-Noir: 18261 ratings, Avg Rating: 4.08
Western: 20683 ratings, Avg Rating: 3.64
