# Homework 4 - Recommendation systems and clustering everywhere

Behavioral user data is a valuable resource for understanding audience patterns on Netflix, particularly in the context of UK movies. It offers insights into how viewers interact with the popular streaming platform, allowing researchers and data enthusiasts to explore trends, preferences, and patterns in user engagement with Netflix content. Whether you're interested in analyzing viewing habits, content popularity, or user demographics, this information provides a rich source to gain a deeper understanding of Netflix audience behavior in the United Kingdom.

Now, you and your team have been hired by Netflix to get to know their users. In other words, you will implement hashing and clustering techniques to extract relevant information and highlights from those users and their behavior inside the platform.

Then, let's get started!

## 1. Recommendation system


Implementing a recommendation system is critical for businesses and digital platforms that want to thrive in today's competitive environment. These systems use data-driven personalization to tailor content, products, and services to individual user preferences. The latter improves user engagement, satisfaction, retention, and revenue through increased sales and cross-selling opportunities. In this section, you will attempt to implement a recommendation system by identifying similar users' preferences and recommending movies they watch to the study user.

To be more specific, you will implement your version of the LSH algorithm, which will take as input the user's preferred genre of movies, find the most similar users to this user, and recommend the most watched movies by those who are more similar to the user.

Data: The data you will be working with can be found here.

Looking at the data, you can see that there is data available for each user for the movies the user clicked on. Gather the title and genre of the maximum top 10 movies that each user clicked on regarding the number of clicks.

In [47]:
import numpy as np
import pandas as pd
from collections import Counter

In [48]:
df = pd.read_csv("vodclickstream_uk_movies_03.csv")

In [49]:
df.head()

Unnamed: 0.1,Unnamed: 0,datetime,duration,title,genres,release_date,movie_id,user_id
0,58773,2017-01-01 01:15:09,0.0,"Angus, Thongs and Perfect Snogging","Comedy, Drama, Romance",2008-07-25,26bd5987e8,1dea19f6fe
1,58774,2017-01-01 13:56:02,0.0,The Curse of Sleeping Beauty,"Fantasy, Horror, Mystery, Thriller",2016-06-02,f26ed2675e,544dcbc510
2,58775,2017-01-01 15:17:47,10530.0,London Has Fallen,"Action, Thriller",2016-03-04,f77e500e7a,7cbcc791bf
3,58776,2017-01-01 16:04:13,49.0,Vendetta,"Action, Drama",2015-06-12,c74aec7673,ebf43c36b6
4,58777,2017-01-01 19:16:37,0.0,The SpongeBob SquarePants Movie,"Animation, Action, Adventure, Comedy, Family, ...",2004-11-19,a80d6fc2aa,a57c992287


In [50]:
df.shape

(671736, 8)

## datetime

In [51]:
len(df[df["datetime"].isna()])

0

In [52]:
min(df["datetime"].unique())

'2017-01-01 00:02:21'

In [53]:
max(df["datetime"].unique())

'2019-06-30 23:59:20'

## duration

In [54]:
len(df[df["duration"].isna()])

0

In [55]:
min(df["duration"].unique())

-1.0

In [56]:
max(df["duration"].unique())

18237253.0

In [57]:
df = df[df["duration"] >= 0]

## title

In [58]:
len(df[df["title"].isna()])

0

In [59]:
len(df[df["title"]==" "])

0

In [60]:
len(df["title"].unique())

7874

## genres

In [61]:
len(df[df["genres"].isna()])

0

In [62]:
len(df[df["genres"]==" "])

0

In [63]:
df = df[df["genres"] != "NOT AVAILABLE"]

## release_date

In [64]:
len(df[df["release_date"].isna()])

0

In [65]:
min(df["release_date"].unique())

'1920-10-01'

In [66]:
max(df["release_date"].unique())

'NOT AVAILABLE'

In [67]:
#Get only the data since Netflix was created
df = df[df["release_date"] >= "2007-01-16"]
df = df[df["release_date"] != "NOT AVAILABLE"]

## movie_id

In [68]:
len(df[df["movie_id"].isna()])

0

In [69]:
len(df["movie_id"].unique())

5442

## user_id

In [70]:
len(df[df["user_id"].isna()])

0

In [71]:
len(df["user_id"].unique())

137665

In [72]:
df.head()

Unnamed: 0.1,Unnamed: 0,datetime,duration,title,genres,release_date,movie_id,user_id
0,58773,2017-01-01 01:15:09,0.0,"Angus, Thongs and Perfect Snogging","Comedy, Drama, Romance",2008-07-25,26bd5987e8,1dea19f6fe
1,58774,2017-01-01 13:56:02,0.0,The Curse of Sleeping Beauty,"Fantasy, Horror, Mystery, Thriller",2016-06-02,f26ed2675e,544dcbc510
2,58775,2017-01-01 15:17:47,10530.0,London Has Fallen,"Action, Thriller",2016-03-04,f77e500e7a,7cbcc791bf
3,58776,2017-01-01 16:04:13,49.0,Vendetta,"Action, Drama",2015-06-12,c74aec7673,ebf43c36b6
5,58778,2017-01-01 19:21:37,0.0,London Has Fallen,"Action, Thriller",2016-03-04,f77e500e7a,c5bf4f3f57


### 1.2 Minhash Signatures


Using the movie genre and user_ids, try to implement your min-hash signatures so that users with similar interests in a genre appear in the same bucket.

Important note: You must write your minhash function from scratch. You are not permitted to use any already implemented hash functions. Read the class materials and, if necessary, conduct an internet search. The description of hash functions in the book may be helpful as a reference.

In [73]:
#Get for every user the top10 movies.
distinct_movie_genre = df[['movie_id', 'genres', "title"]].drop_duplicates()
clicks = df.groupby(['user_id', 'movie_id']).size().reset_index(name='Number_of_clicks')
movie_genre = clicks.sort_values(by='Number_of_clicks', ascending=False).groupby('user_id').head(10).reset_index(drop=True)
movie_genre = pd.merge(movie_genre, distinct_movie_genre, on = "movie_id", how = "left")
movie_genre["genres"] = movie_genre["genres"].str.split(", ")

In [74]:
movie_genre.head()

Unnamed: 0,user_id,movie_id,Number_of_clicks,genres,title
0,7cdfd0e14a,40bccd3001,88,"[Drama, Fantasy, Romance]",Twilight
1,e06f0be797,3f3b34e56f,54,"[Action, Comedy, Crime, Thriller]",Rush Hour 3
2,59416738c3,cbdf9820bc,54,"[Comedy, Romance]",The Ex
3,49d091aa63,b8a2658c23,48,"[Comedy, Romance, Sport]",Chalet Girl
4,3675d9ba4a,948f2b5bf6,42,"[Drama, Romance, Sci-Fi, Thriller]",Passengers


In [75]:
#Generate  slist of uniques genres
list_of_genres = list(movie_genre["genres"])
unique_genres = set(genre for genres in list_of_genres for genre in genres)
unique_genres_list = list(unique_genres)
genre_dict = {genre: i for i, genre in enumerate(sorted(unique_genres_list))}

In [76]:
#Genrate a list of unique users
users = movie_genre["user_id"].unique()
users_dict = {user: i for i, user in enumerate(sorted(users))}
inverted_users_dict = {value: key for key, value in users_dict.items()}

In [77]:
def matrix_representation_users_genres(users_dict, genre_dict, movie_genre):
    """
    Return the matrix representation, with cols as users and rows as genres. 1 will mean that a user has in common this genre, 0 otherwise.
    """
    df_shape = movie_genre.shape[0]
    rows = len(genre_dict)
    cols = len(users_dict)
    matrix_representation = np.zeros((rows, cols), dtype = int)
    
    for i in range(df_shape):    
        user = movie_genre.iloc[i][0]
        genres = movie_genre.iloc[i][3]
        for genre in genres:
            matrix_representation[genre_dict[genre], users_dict[user]] = 1
    return matrix_representation

In [78]:
def signature_matrix_minhash(n_hashes, hash_function, matrix_representation):
    """
    Compute minhash signature matrix with n_hashes hash functions genrated randomly.
    """
    np.random.seed(41)
    cols = len(matrix_representation[0])
    signature_matrix = np.full((n_hashes, cols), np.inf)
    a_b = [(round(np.random.uniform(1, 999)), round(np.random.uniform(1, 999))) for _ in range(n_hashes)]
    for r in range(len(matrix_representation)):

        hashes = [hash_function(a_b[i][0], r, a_b[i][1]) for i in range(n_hashes)]

        cols_with_one = list(np.nonzero(matrix_representation[r])[0])

        for col in cols_with_one:
            for h in range(n_hashes):
                if signature_matrix[h, col] > hashes[h]:
                    signature_matrix[h, col] = hashes[h]
    return signature_matrix

In [79]:
matrix_representation = matrix_representation_users_genres(users_dict, genre_dict, movie_genre)
n_hashes = 20
hash_function = lambda a, x, b : (a * x + b) % 31
signature_matrix = signature_matrix_minhash(n_hashes, hash_function, matrix_representation)

In [80]:
signature_matrix

array([[ 0.,  8.,  6., ...,  6.,  0.,  3.],
       [ 1.,  5.,  2., ...,  2.,  2.,  5.],
       [ 0., 10., 10., ..., 13.,  4.,  4.],
       ...,
       [ 1., 10., 10., ..., 10.,  0.,  2.],
       [ 0.,  2.,  2., ...,  2.,  0.,  4.],
       [ 0.,  0.,  0., ...,  0.,  3.,  0.]])

### 1.3 Locality-Sensitive Hashing (LSH)


Now that your buckets are ready, it's time to ask a few queries. We will provide you with some user_ids and ask you to recommend at most five movies to the user to watch based on the movies clicked by similar users.

To recommend at most five movies given a user_id, use the following procedure:

1. Identify the two most similar users to this user.
2. If these two users have any movies in common, recommend those movies based on the total number of clicks by these users.
3. If there are no more common movies, try to propose the most clicked movies by the most similar user first, followed by the other user.

Note: At the end of the process, we expect to see at most five movies recommended to the user.

Example: assume you've identified user A and B as the most similar users to a single user, and we have the following records on these users:

- User A with 80% similarity
- User B with 50% similarity

| user | movie title              | #clicks |
|------|--------------------------|---------|
| A    | Wild Child               | 20      |
| A    | Innocence                | 10      |
| A    | Coin Heist               | 2       |
| B    | Innocence                | 30      |
| B    | Coin Heist               | 15      |
| B    | Before I Fall            | 30      |
| B    | Beyond Skyline           | 8       |
| B    | The Amazing Spider-Man   | 5       |

- **Recommended Movies in Order:**
   - Innocence
   - Coin Heist
   - Wild Child
   - Before I Fall
   - Beyond Skyline


In [81]:
def hashing_function(bucket):
    """
    Hashing of a tuple. Concatenate all the hasgings of the elements of the tuple.
    """
    hashing = ""
    np.random.seed(41)
    
    for elm in bucket:
        hashing += str((round(np.random.uniform(1, 999))) * int(elm)+ round(np.random.uniform(1, 999)) % 997)
    return int(hashing)

In [82]:
def lsh(signature_matrix, rows, inverted_dict):
    """
    Compute lsh algorithm, and return a dictionary of all buckets as key and users as values.
    """
    buckets = {}
    signature_matrix = signature_matrix.T
    for index, row in enumerate(signature_matrix):
        for n in range(0, len(row),rows):
            band = row[n:n+rows]
            hashed_value = hashing_function(band)
            if hashed_value in buckets:
                buckets[hashed_value].append(inverted_dict[index])
            else:
                buckets[hashed_value] = [inverted_dict[index]]
    return buckets

In [83]:
def most_common_user(user_id, bucket):
    """
    Return the 2 most common users to a given user_id.
    """
    buckets_user = []
    users = []
    for bucket in buckets.values():
        if user_id in bucket:
            buckets_user.append(bucket)
    for bucket in buckets_user:
        users.append(bucket)
    users = list(np.concatenate(users))
    counts = Counter(users).most_common()
    return (counts[0][0], counts[1][0])

In [84]:
def get_films(common_users, movie_genre):
    """
    Recommend the most similar films to the given users following the instructions of statment.
    """
    user1 = common_users[0]
    user2 = common_users[1]
    final_df = pd.DataFrame(columns=["movie_id", "user_id", "title", "Number_of_clicks"])
    films_to_show = 5

    # 1. Movies in common based on number_clicks
    df_movies1 = movie_genre[(movie_genre["user_id"] == user1) & (movie_genre["user_id"] == user2)]
    df_movies1 = df_movies1.groupby("movie_id")["Number_of_clicks"].sum().reset_index()
    df_movies1 = df_movies1.sort_values(by="Number_of_clicks", ascending=False)
    if not df_movies1.empty:
        final_df = pd.concat([final_df, df_movies1[["movie_id", "user_id", "title", "Number_of_clicks"]]])

    # 2. Most clicked movies by the first user
    df_movies2 = movie_genre[movie_genre["user_id"] == user1].sort_values(by="Number_of_clicks", ascending=False)
    if not df_movies2.empty:
        final_df = pd.concat([final_df, df_movies2[["movie_id", "user_id", "title", "Number_of_clicks"]]])

    # 3. Most clicked movies by the second user
    df_movies3 = movie_genre[movie_genre["user_id"] == user2].sort_values(by="Number_of_clicks", ascending=False)
    if not df_movies3.empty:
        final_df = pd.concat([final_df, df_movies3[["movie_id", "user_id", "title", "Number_of_clicks"]]])

    return final_df.drop_duplicates()

In [85]:
buckets = lsh(signature_matrix, 4, inverted_users_dict)
user_id = "49d091aa63"
mc_users = most_common_user(user_id, buckets)
df_films = get_films(mc_users, movie_genre)
df_films.head(5)[["movie_id", "title"]]

Unnamed: 0,movie_id,title
253803,117c9dc515,Set It Up
253808,7b3d8d5976,Bring It On: Fight to the Finish
253215,f80b7002bb,Anchorman: The Legend Continues
253216,771f79dd7e,The Love Guru


## 2. Grouping Users together!


Now, we will deal with clustering algorithms that will provide groups of Netflix users that are similar among them.

To solve this task, you must accomplish the following stages:

### 2.1 Getting your data + feature engineering

1. Access to the data found in this dataset

2. Sometimes, the features (variables, fields) are not given in a dataset but can be created from it; this is known as feature engineering. For example, the original dataset has several clicks done by the same user, so grouping data by user_id will allow you to create new features for each user:

a) Favorite genre (i.e., the genre on which the user spent the most time)

b) Average click duration

c) Time of the day (Morning/Afternoon/Night) when the user spends the most time on the platform (the time spent is tracked through the duration of the clicks)

d) Is the user an old movie lover, or is he into more recent stuff (content released after 2010)?

e) Average time spent a day by the user (considering only the days he logs in)

So, in the end, you should have for each user_id five features.

3. Consider at least 10 additional features that can be generated for each user_id (you can use chatGPT or other LLM tools for suggesting features to create). Describe each of them and add them to the previous dataset you made (the one with five features). In the end, you should have for each user at least 15 features (5 recommended + 10 suggested by you).

In [40]:
#CODE

### 2.2 Choose your features (variables)!


You may notice that you have plenty of features to work with now. So, it would be best to find a way to reduce the dimensionality (reduce the number of variables to work with). You can follow the subsequent directions to achieve it:

1. To normalise or not to normalise? That's the question. Sometimes, it is worth normalizing (scaling) the features. Explain if it is a good idea to perform any normalization method. If you think the normalization should be used, apply it to your data (look at the available normalization functions in the scikit-learn library).

2. Select one method for dimensionality reduction and apply it to your data. Some suggestions are Principal Component Analysis, Multiple Correspondence Analysis, Singular Value Decomposition, Factor Analysis for Mixed Data, Two-Steps clustering. Make sure that the method you choose applies to the features you have or modify your data to be able to use it. Explain why you chose that method and the limitations it may have.

In [41]:
#CODE

### Clustering!

1. Implement the K-means clustering algorithm (not ++: random initialization) using MapReduce. We ask you to write the algorithm from scratch following what you learned in class.

2. Find an optimal number of clusters. Use at least two different methods. If your algorithms provide diverse optimal K's, select one of them and explain why you chose it.

3. Run the algorithm on the data obtained from the dimensionality reduction.

4. Implement K-means++ from scratch and explain the differences with the results you got earlier.

5. Ask ChatGPT to recommend other clustering algorithms and choose one. Explain your choice, then ask ChatGPT to implement it or use already implemented versions (e.g., the one provided in the scikit-learn library) and run it on your data. Explain the differences (if there are any) in the results. Which one is the best, in your opinion, and why?

In [42]:
#CODE

### 2.4 Analysing your results! --


You are often encouraged to explain the main characteristics that your clusters have. The latter is called the Characterizing Clusters step. Thus, follow the next steps to do it:

1. Select 2-3 variables you think are relevant to identify the cluster of the customer. For example, Time_Day, Average Click Duration, etc.

2. Most of your selected variables will be numerical (continuous or discrete), then categorize them into four categories.

3. With the selected variables, perform pivot tables. On the horizontal axis, you will have the clusters, and on the vertical axis, you will have the categories of each variable. Notice that you have to do one pivot table per variable.

4. Calculate the percentage by column for each pivot table. The sum of each row (cluster) must be 100. The sample example for clustering with K = 4 and Time_Day variable:

| Time_Day | Afternoon | Morning | Night |
|----------|-----------|---------|-------|
| Cluster|          |       |      |
| 1| 3         | 94      | 3     |
| 2| 83        | 5       | 12    |
| 3| 16        | 10      | 74    |
| 4| 34        | 18      | 48    |


5. Interpret the results for each pivot table.

6. Use any known metrics to estimate clustering algorithm performance (how good are the clusters you found?). Comment on the results obtained.

## 3. Bonus Question

We remind you that we consider and grade the bonuses only if you complete the entire assignment.

Density-based clustering identifies clusters as regions in the data space with high point density that are separated from other clusters by regions of low point density. The data points in the separating regions of low point density are typically considered noise or outliers. Typical algorithms that fall into this category are OPTICS and DBSCAN.

1. Ask ChatGPT (or any other LLM tool) to list three algorithms for Density-Based Clustering. Choose one and use it on the same dataset you used in 2.3. Analyze your results: how different are they from the centroid-based version?

Note: You can implement your algorithm from scratch or use the one implemented in the scikit-learn library; the choice is up to you!

In [43]:
#CODE

## 4. Command Line Question (CLQ)


Here is another command line question to enjoy. We previously stated that using the command line tools is a skill that Data Scientists must master.

In this question, you should use any command line tool that you know to answer the following questions using the same dataset that you have been using so far:

- What is the most-watched Netflix title?
- Report the average time between subsequent clicks on Netflix.com
- Provide the ID of the user that has spent the most time on Netflix

Important note: You may work on this question in any environment (AWS, your PC command line, Jupyter notebook, etc.), but the final script must be placed in CommandLine.sh, which must be executable. Please run the script and include a screenshot of the output in the notebook for evaluation.



In [44]:
#CODE

## 5. Algorithmic Question (AQ)

Federico studies in a demanding university where he has to take a certain number N of exams to graduate, but he is free to choose in which order he will take these exams. Federico is panicking since this university is not only one of the toughest in the world but also one of the weirdest. His final grade won't depend at all on the mark he gets in these courses: there's a precise evaluation system.

He was given an initial personal score of S when he enrolled, which changes every time he takes an exam: now comes the crazy part. He soon discovered that every of the N exams he has to take is assigned a mark p. Once he has chosen an exam, his score becomes equal to the mark p, and at the same time, the scoring system changes:

- If he takes an "easy" exam (the score of the exam being less than his score), every other exam's mark is increased by the quantity S - p

- If he takes a "hard" exam (the score of the exam is greater than his score), every other exam's mark is decreased by the quantity p - S

So, for example, consider S = 8 as the initial personal score. Federico must decide which exam he wants to take, being [5, 7, 1] the marks list. If he takes the first one, being 5 < 8 and 8 - 5 = 3, the remaining list now becomes [10, 4], and his score is updated as S = .

In this chaotic university where the only real exam seems to be choosing the best way to take exams, you are the poor student advisor who is facing a long queue of confused people who need some help. Federico is next in line, and he comes up in turn with an inescapable question: he wants to know which is the highest score possible he could get.

a) Fortunately, you have a computer app designed by a brilliant student. Federico wants you to show him the code which this app is based on because he wants to do paid counseling for other desperate students: in a recursive fashion, the helped helps the helpable.

In [1]:
from itertools import zip_longest
#initial personal score
S=int(input())
# marks
exams_grades=list(map(int,input().split()))
exams_grades.sort()
#we create 2 list where we put the greatest and smallest numbers
l1=exams_grades[:len(exams_grades)//2]
l2=exams_grades[len(exams_grades)//2:]
#we create a new list with an element from the list of the greatest numbers and an element from the list of the smallest numbers
if len(l2)==len(l1):
  new_list = [elemento for val in zip_longest(l1, l2) for elemento in val]
else:
  new_list = [elemento for val in zip_longest(l2, l1) for elemento in val if elemento is not None]

# we update the marks according to the rule given by the university
for x in range(len(new_list)):
  grade=new_list[x]
  diff=S-grade
  for y in range(x+1,len(new_list)):
      new_list[y]+=diff
  S=grade
print(S)

30
13 27 41 59 28 33 39 19 52 48 55 79
205


b) Federico is getting angry because he claims that your code is slow! Show him formally with a big-O notation that he is as crazy as this university!

### Time Complexity Analysis


**Input Reading**:
Reading
S and
exams_grades
exams_grades takes constant time, so the complexity is O(1).

**List Sorting**: Sorting the list
exams_grades has a time complexity of
O(nlogn), where
n is the length of exams_grades.

**Creating new_list** : The creation of new_list is performed once and has a linear time complexity relative to the length of the list. Therefore, it is
O(n).

**Updating Marks**: The time complexity analysis for the updating marks part is as follows:

- The outer loop runs N times.
- The inner loop also runs, on average, N/2 times across all iterations of the outer loop.
- Inside the inner loop, updating the elements of `new_list` takes constant time, O(1).
- Therefore, the total time complexity for this part is O(N^2).

**In conclusion** : Considering both factors, the overall time complexity of the provided code is dominated by the quadratic term, resulting in a final time complexity of O(N^2) due to the updating marks part.


c) If, unfortunately, Federico is right in the grip of madness, he will threaten you to optimize the code through a different approach. You should end this theater of the absurd by any means! (And again, formally prove that you improved time complexity)

In [2]:
def best_grade_improved(S=int(input()),exams_grades=list(map(int, input().split()))):
    exams_grades.sort()
    n = len(exams_grades)
    l1 = exams_grades[:n//2]
    l2 = exams_grades[n//2:]
# if the number of the exams is even or odd we have to create the list in a different way
    if n % 2 == 0:
        new_list = [elemento for val in zip_longest(l1, l2) for elemento in val]
    else:
        new_list = [elemento for val in zip_longest(l2, l1) for elemento in val if elemento is not None]
    # after have put the marks in a proper way we call the recursive function to find the best final grade
    adjust_grades_recursive(S, new_list)


def adjust_grades_recursive(S, exams_grades_systemed):
    if not exams_grades_systemed:
        print(S)
        return
    grade = exams_grades_systemed[0]
    diff=S-grade
    # we update the marks according to the rule given by the university
    for y in range(1, len(exams_grades_systemed)):
      exams_grades_systemed[y] += diff
    S = grade
    adjust_grades_recursive(S, exams_grades_systemed[1:])


best_grade_improved()

30
13 27 41 59 28 33 39 19 52 48 55 79
205


### Time Complexity Analysis

1. **Input Reading and List Sorting:**
   - Reading the personal score `S` and creating the `exams_grades` list take O(N) time.
   - Sorting the `exams_grades` list takes O(N log N) time.

2. **Creating Lists `l1` and `l2`:**
   - Creating `l1` and `l2` by slicing the sorted list takes O(N) time.

3. **Creating `new_list`:**
   - Creating `new_list` using `zip_longest` and list comprehensions takes O(N) time.

4. **Recursive Grade Adjustment:**
   - The recursive function `adjust_grades_recursive` processes each element of `exams_grades_systemed` once.
   - In each iteration, the function performs constant-time operations (O(1)).
   - In the worst case, the recursion depth is N.
   - Therefore, the total time complexity for this part is O(N).

Considering the sorting operation, the overall time complexity of the improved code is O(N log N) due to the dominant sorting step.

So this is more efficient because to update marks has a complexity of O(N) and not O(N^2) as in the former version.

d) Ask chatGPT for a third (optimized) implementation and analyze again its time complexity. Be careful (and crafty) in defining the prompt, and challenge the machine in this coding question!

In [4]:
from statistics import median
def best_grade_improved_chatgpt(S=int(input()), exams_grades=list(map(int, input().split()))):
  # we use the median to split the exams_grades in two list one with the largest numbers and the other one with the smallest number
    median_value = median(exams_grades)

    l1 = [element for element in exams_grades if element < median_value]

    l2 = [element for element in exams_grades if element >= median_value]

    if len(l2)==len(l1):
        new_list = [elemento for val in zip_longest(l1, l2) for elemento in val]
    else:
        new_list = [elemento for val in zip_longest(l2, l1) for elemento in val if elemento is not None]

    adjust_grades_recursive(S, new_list)

def adjust_grades_recursive(S, exams_grades_systemed):
    if not exams_grades_systemed:
        print(S)
        return
    grade = exams_grades_systemed[0]
    diff = S - grade
    for y in range(1, len(exams_grades_systemed)):
        exams_grades_systemed[y] += diff
    S = grade
    adjust_grades_recursive(S, exams_grades_systemed[1:])

best_grade_improved_chatgpt()

30
13 27 41 59 28 33 39 19 52 48 55 79
205


## Time Complexity Analysis

### Input Reading and Median Computation:

- Reading the personal score `S` and creating the `exams_grades` list take O(N) time.
- Computing the median using the `statistics.median` function has an expected time complexity of O(N) but in the worst case is O(N log N) so we have an improvement only on average.

### Creating Lists `l1` and `l2`:

- Creating `l1` and `l2` using list comprehensions takes O(N) time.

### Creating `new_list`:

- Creating `new_list` using `zip_longest` and list comprehensions takes O(N) time.

### Recursive Grade Adjustment:

- The recursive function `adjust_grades_recursive` processes each element of `exams_grades_systemed` once.
- In each iteration, the function performs constant-time operations (O(1)).
- In the worst case, the recursion depth is N.
- Therefore, the total time complexity for this part is O(N).

### Overall Time Complexity:

Considering all the steps, the overall time complexity of the provided code is O(N log N). The dominant factor is the `statistics.median` function contributing to the overall complexity, we have an improvement only in the average case because `statistics.median`has a complexity of O(N).

Here are some input/output examples (the first value is the initial personal score, and the second line contains the list of marks):

#### Input 1

8

5 7 1 

#### Output 1

11

#### Input 2

25

18 24 21 32 27

#### Output 2

44

#### Input 3

30

13 27 41 59 28 33 39 19 52 48 55 79

#### Output 3



205