# Generating additional data and tranforming Movielens data into a semi-stuctured form.

The generation and transformation of data was done, because the chosen dataset was too simple for the scope of the course project.

The original dataset consist of 3 separate CSV files (movies, ratings and links). The data from these files was combined into a semi-structured data format (JSON) and partially transformed into free-form text. Furthermore, additional information about was generated randomly to make the dataset more valuable for the project.

In [None]:
import pandas as pd
import json
import random

In [None]:
# Load CSV files
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
links = pd.read_csv('links.csv')

## Generating user information

For each user, a nationality is selected from a set list of nationality, and the user's age from a set range of 15-60.

In [None]:
# generate user info
nationalities = ["Estonian", "American", "Hungarian", "British", "Italian", "Singaporean", "Moroccan", "Danish", "Canadian", "Netherlander", "Swedish", "Swiss"]

age_range = (15, 61)
genders = ['M', 'F']

user_df = []

for i in range(1, 611):
  nationality = random.choice(nationalities)
  gender = random.choice(genders)
  age = random.randint(*age_range)

  user_df.append({
      "user_id": i,
      "nationality": nationality,
      "gender": gender,
      "age": age
  })

user_df = pd.DataFrame(user_df)
user_df.to_csv('users.csv', index=False)

In [None]:
user_df.head()

Unnamed: 0,user_id,nationality,gender,age
0,1,Estonian,F,19
1,2,British,M,28
2,3,Moroccan,M,31
3,4,American,M,55
4,5,Swedish,M,34


## Transforming the Movielens data

A JSON object is created for each move in the dataset. Each object contains the movies title, details (genres), ratings in free form śimilar to a log, and external links (IMDB and TMDB indices) as a string.

In [None]:
# Transform ratings into a "raw" form
def create_related_data(movie_id):
    movie_ratings = ratings[ratings['movieId'] == movie_id]

    ratings_log = []
    for _, row in movie_ratings.iterrows():
        ratings_log.append(f"Rated {row['rating']} by user {row['userId']} at {row['timestamp']}")

    return ratings_log

# Create structure
raw_data = []

for _, movie in movies.iterrows():
    movie_id = movie['movieId']
    related_data = create_related_data(movie_id)
    external_links = f"IMDB ID: {links[links['movieId'] == movie_id]['imdbId'].values[0]}, " \
                     f"TMDB ID: {links[links['movieId'] == movie_id]['tmdbId'].values[0]}"

    raw_data.append({
        "movie": movie['title'],
        "details": f"Genres: {', '.join(movie['genres'].split('|')) if pd.notnull(movie['genres']) else 'N/A'}",
        "ratings": related_data,
        "external_links": external_links
    })

# Save to JSON file
with open('raw_unstructured_data.json', 'w') as f:
    json.dump(raw_data, f, indent=4)

In [None]:
raw_data[0]

{'movie': 'Toy Story (1995)',
 'details': 'Genres: Adventure, Animation, Children, Comedy, Fantasy',
 'ratings': ['Rated 4.0 by user 1.0 at 964982703.0',
  'Rated 4.0 by user 5.0 at 847434962.0',
  'Rated 4.5 by user 7.0 at 1106635946.0',
  'Rated 2.5 by user 15.0 at 1510577970.0',
  'Rated 4.5 by user 17.0 at 1305696483.0',
  'Rated 3.5 by user 18.0 at 1455209816.0',
  'Rated 4.0 by user 19.0 at 965705637.0',
  'Rated 3.5 by user 21.0 at 1407618878.0',
  'Rated 3.0 by user 27.0 at 962685262.0',
  'Rated 5.0 by user 31.0 at 850466616.0',
  'Rated 3.0 by user 32.0 at 856736119.0',
  'Rated 3.0 by user 33.0 at 939647444.0',
  'Rated 5.0 by user 40.0 at 832058959.0',
  'Rated 5.0 by user 43.0 at 848993983.0',
  'Rated 3.0 by user 44.0 at 869251860.0',
  'Rated 4.0 by user 45.0 at 951170182.0',
  'Rated 5.0 by user 46.0 at 834787906.0',
  'Rated 3.0 by user 50.0 at 1514238116.0',
  'Rated 3.0 by user 54.0 at 830247330.0',
  'Rated 5.0 by user 57.0 at 965796031.0',
  'Rated 5.0 by user 63.0