# GenAI-Camp: Day 01
## Lesson: Data Understanding with Pandas

This lesson is intended to show you the basics of data understanding using *pandas*.

During this lesson you will learn how to ...

- read tabular data
- explore descriptive statistics
- do simple data manipulation

### Set up the environment
Import the necessary libraries, set constants, and define helper functions.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os

In [None]:
# Check runtime environment to make sure we are running in a colab environment. 
if os.getenv("COLAB_RELEASE_TAG"):
   COLAB = True
   print("Running on COLAB environment.") 
else:
   COLAB = False
   print("WARNING: Running on LOCAL environment.")

In [None]:
# Define path of ressources
if COLAB:
    # Clone the data repository into colab
    !git clone https://github.com/openknowledge/workshop-genai-camp-data.git
    DATA_PATH = "/content/workshop-genai-camp-data/day-01/data"
else:
    DATA_PATH = "../data"
IMDB_FILE = DATA_PATH + "/imdb_dataset.csv"

### Introduction: [Pandas](https://pandas.pydata.org/docs/)

In [None]:
# Create a sample dataframe for demonstration
data = {
    'Movie': ['Inception', 'The Dark Knight', 'Interstellar', 'Dunkirk', 'Greatest Showman'],
    'Year': [2010, 2008, 2014, 2017, 2017],
    'Rating': [8.8, 9.0, 8.6, 7.9, 7.5]
}

# Load the sample data into a pandas DataFrame
df_introduction = pd.DataFrame(data)
df_introduction

Unnamed: 0,Movie,Year,Rating
0,Inception,2010,8.8
1,The Dark Knight,2008,9.0
2,Interstellar,2014,8.6
3,Dunkirk,2017,7.9
4,Greatest Showman,2017,7.5


In [None]:
# A column of a dataframe is called Series
type(df_introduction['Movie'])

pandas.core.series.Series

In [None]:
# Accessing a column in a dataframe returns the Series
df_introduction['Movie']

0           Inception
1     The Dark Knight
2        Interstellar
3             Dunkirk
4    Greatest Showman
Name: Movie, dtype: object

In [None]:
# Filter the dataset to include only movies with a rating of 8.0 or higher
high_rated_movies = df_introduction[df_introduction['Rating'] >= 8.0]
high_rated_movies

Unnamed: 0,Movie,Year,Rating
0,Inception,2010,8.8
1,The Dark Knight,2008,9.0
2,Interstellar,2014,8.6


In [None]:
# Create a new column 'Decade' based on the 'Year' column
df_introduction['Decade'] = (df_introduction['Year'] // 10) * 10
df_introduction

Unnamed: 0,Movie,Year,Rating,Decade
0,Inception,2010,8.8,2010
1,The Dark Knight,2008,9.0,2000
2,Interstellar,2014,8.6,2010
3,Dunkirk,2017,7.9,2010
4,Greatest Showman,2017,7.5,2010


In [None]:
# Get the movie title with the highest rating
highest_rated_movie = df_introduction.loc[df_introduction['Rating'].idxmax(), 'Movie']
highest_rated_movie

'The Dark Knight'

### IMDB Dataset of Movie Reviews with Sentiments
The IMDB Dataset is a popular dataset used for natural language processing (NLP) and machine learning tasks, particularly for binary sentiment classification (positive/negative). It contains a large collection of movie reviews from the Internet Movie Database (IMDB), labeled according to their sentiment.

In [None]:
# Load the IMDB dataset
df = pd.read_csv(IMDB_FILE)
df

Unnamed: 0,text,label,year
0,I grew up (b. 1965) watching and loving the Th...,0,2015
1,"When I put this movie in my DVD player, and sa...",0,2022
2,Why do people who do not know what a particula...,0,2015
3,Even though I have great interest in Biblical ...,0,2021
4,Im a die hard Dads Army fan and nothing will e...,1,2021
...,...,...,...
39995,"""Western Union"" is something of a forgotten cl...",1,2018
39996,This movie is an incredible piece of work. It ...,1,2021
39997,My wife and I watched this movie because we pl...,0,2023
39998,"When I first watched Flatliners, I was amazed....",1,2018


### Exercise 01: Data Inspection and Initial Understanding
Understand the structure and contents of the IMDB dataset. Inspect the data types, check for missing values, and analyze the distribution of labels (positive/negative reviews).
1. Display the first 5 rows and last 5 rows of the dataset.
2. Check the column names, data types, and the number of missing values for each column.
3. Count the number of positive and negative reviews.

**Hints**:
* Use `.head()` and `.tail()` to inspect the rows of a dataframe (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)).
* `.info()` will give you the column names and data types (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)).
* `.isna().sum()` will tell you if there are any missing values in the dataset (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html)).
* Use `.value_counts()` to count the distribution of positive and negative labels (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)).


In [None]:
# TODO: Your implementation
df.head()

Unnamed: 0,text,label,year
0,I grew up (b. 1965) watching and loving the Th...,0,2015
1,"When I put this movie in my DVD player, and sa...",0,2022
2,Why do people who do not know what a particula...,0,2015
3,Even though I have great interest in Biblical ...,0,2021
4,Im a die hard Dads Army fan and nothing will e...,1,2021


In [None]:
df.tail()

Unnamed: 0,text,label,year
39995,"""Western Union"" is something of a forgotten cl...",1,2018
39996,This movie is an incredible piece of work. It ...,1,2021
39997,My wife and I watched this movie because we pl...,0,2023
39998,"When I first watched Flatliners, I was amazed....",1,2018
39999,"Why would this film be so good, but only gross...",1,2016


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    40000 non-null  object
 1   label   40000 non-null  int64 
 2   year    40000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 937.6+ KB


In [None]:
df.isna().sum()

text     0
label    0
year     0
dtype: int64

In [None]:
df['label'].value_counts()

label
0    20019
1    19981
Name: count, dtype: int64

### Exercise 02: Show examples
Print out some examples for positive and negative reviews.
1. Print out 5 random reviews which have the label "0"
2. Print out 5 random reviews which have the label "1"

**Hints**:
* Use `.sample()` on the filtered dataframe (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)).
* You can create a python list from the samples by using `to_list()`
* Iterate over this list and print the elements

In [None]:
# Filter the dataframe by the 'label' column
filtered_label_0_reviews = df[df['label'] == 0]
filtered_label_1_reviews = df[df['label'] == 1]

# TODO: Your implementation
label_0_reviews = filtered_label_0_reviews.sample(5)["text"].to_list()
label_1_reviews = filtered_label_1_reviews.sample(5)["text"].to_list()

print("Sample reviews with label 0:")
for review in label_0_reviews:
    print(review)

print()
print("Sample reviews with label 1:")
for review in label_1_reviews:
    print(review)

### Exercise 03: Minor data manipulation
Rename columns and values of the given dataframe.

1. Rename the column "label" to "sentiment"
2. Rename the column "text" to "review"
3. Rename values in column "sentiment" to "positive" and "negative"
4. Print the dataframe

**Hints**:
* Use `.rename()` for renaming columns (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html))
* Use `.map()` for updating values of a column (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html))

In [None]:
# TODO: Your implementation
df.rename(columns={"text": "review", "label": "sentiment"}, inplace=True)
df["sentiment"] = df["sentiment"].map({0: "negative", 1: "positive"})
df

Unnamed: 0,review,sentiment,year
0,I grew up (b. 1965) watching and loving the Th...,negative,2015
1,"When I put this movie in my DVD player, and sa...",negative,2022
2,Why do people who do not know what a particula...,negative,2015
3,Even though I have great interest in Biblical ...,negative,2021
4,Im a die hard Dads Army fan and nothing will e...,positive,2021
...,...,...,...
39995,"""Western Union"" is something of a forgotten cl...",positive,2018
39996,This movie is an incredible piece of work. It ...,positive,2021
39997,My wife and I watched this movie because we pl...,negative,2023
39998,"When I first watched Flatliners, I was amazed....",positive,2018


### Exercise 04: Summary Statistics on Review Lengths
Analyze the length of the movie reviews in terms of character count. Calculate key summary statistics for the review lengths, including the average, maximum, and minimum lengths.
1. Add a new column called review_length that stores the length of each review (in characters).
2. Calculate the average, maximum, and minimum review lengths.
3. Find the median and standard deviation of the review lengths.
4. Find the longest and shortest reviews in the dataset.

**Hints**:
* Use `.apply(len)` to create the review_length column  (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)).
* Use `.mean()`, `.max()`, `.min()`, `.median()`, and `.std()` to get the desired statistics (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html)).
* To find the longest and shortest reviews, you can use `.idxmax()` and `.idxmin()` on the review_length column (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html)).


In [None]:
# TODO: Your implementation
# Calculate length of reviews
df["review_length"] = df["review"].apply(len)

# Average, Minimum and Maximum
avg_review_length = df["review_length"].mean()
min_review_length = df["review_length"].min()
max_review_length = df["review_length"].max()
print(f"{avg_review_length=}, {min_review_length=}, {max_review_length=}")

# Median & Standard Deviation
median_review_length = df["review_length"].median()
std_review_length = df["review_length"].std()
print(f"{median_review_length=}, {std_review_length=}")

# Find longest review
longest_review = df.loc[df["review_length"].idxmax()]
print(f"Longest review: {longest_review['review']}")

# Find shortest review
shortest_review = df.loc[df["review_length"].idxmin()]
print(f"Shortest review: {shortest_review['review']}")

avg_review_length=np.float64(1310.29325), min_review_length=np.int64(32), max_review_length=np.int64(13704)
median_review_length=np.float64(973.0), std_review_length=np.float64(988.3585988002701)
Longest review: Match 1: Tag Team Table Match Bubba Ray and Spike Dudley vs Eddie Guerrero and Chris Benoit Bubba Ray and Spike Dudley started things off with a Tag Team Table Match against Eddie Guerrero and Chris Benoit. According to the rules of the match, both opponents have to go through tables in order to get the win. Benoit and Guerrero heated up early on by taking turns hammering first Spike and then Bubba Ray. A German suplex by Benoit to Bubba took the wind out of the Dudley brother. Spike tried to help his brother, but the referee restrained him while Benoit and Guerrero ganged up on him in the corner. With Benoit stomping away on Bubba, Guerrero set up a table outside. Spike dashed into the ring and somersaulted over the top rope onto Guerrero on the outside! After recovering and t

### Exercise 05: Sentiment Distribution by length of review
Understand how review lengths vary by sentiment (positive/negative). Investigate whether longer reviews tend to be more positive or negative.
1. Group the reviews by sentiment (positive, negative).
2. Calculate the average, minimum and maximum review length for each sentiment.
3. Compare the distribution of review lengths between positive and negative reviews.

**Hints**:
* Use `.groupby()` to group the data by sentiment (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)).
* Calculate the mean length using `.mean()` and the maximum/minimum lengths with `.agg(['max', 'min'])` (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)).
* To compare distributions, you can use `.describe()` to summarize the statistics for each group (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)).

In [None]:
# TODO: Your implementation here

# Solution (1/2)

# Group the reviews by label (positive/negative)
grouped = df.groupby("sentiment")

# Calculate the average, minimum and maximum review length for each sentiment
stats_count_by_sentiment = grouped["review_length"].agg(["mean", "min", "max"]).rename(
    columns={
        "mean": "avg_review_length",
        "min": "min_review_length",
        "max": "max_review_length"
    }
)

stats_count_by_sentiment

Unnamed: 0_level_0,avg_review_length,min_review_length,max_review_length
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,1292.53699,32,8969
positive,1328.083279,65,13704


In [None]:
# Solution (2/2)
# Compare the distribution of review lengths between positive and negative reviews
print("Review length distribution by sentiment:")
print(grouped["review_length"].describe())

Review length distribution by sentiment:
             count         mean          std   min    25%    50%     75%  \
sentiment                                                                  
negative   20019.0  1292.536990   942.220087  32.0  705.0  973.0  1571.0   
positive   19981.0  1328.083279  1032.236721  65.0  690.0  972.0  1621.0   

               max  
sentiment           
negative    8969.0  
positive   13704.0  
