# GenAI-Camp: Day 01
## Lesson: Data Understanding with Pandas

This lesson is intended to show you the basics of data understanding using *pandas*.

During this lesson you will learn how to ...

- read tabular data
- explore descriptive statistics
- do simple data manipulation

### Set up the environment
Import the necessary libraries, set constants, and define helper functions.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os

In [None]:
# Check runtime environment to make sure we are running in a colab environment. 
if os.getenv("COLAB_RELEASE_TAG"):
   COLAB = True
   print("Running on COLAB environment.") 
else:
   COLAB = False
   print("WARNING: Running on LOCAL environment.")

In [None]:
# Define path of ressources
if COLAB:
    # Clone the data repository into colab
    !git clone https://github.com/openknowledge/workshop-genai-camp-data.git
    DATA_PATH = "/content/workshop-genai-camp-data/day-01/data"
else:
    DATA_PATH = "../data"
IMDB_FILE = DATA_PATH + "/imdb_dataset.csv"

### Introduction: [Pandas](https://pandas.pydata.org/docs/)

In [None]:
# Create a sample dataframe for demonstration
data = {
    'Movie': ['Inception', 'The Dark Knight', 'Interstellar', 'Dunkirk', 'Greatest Showman'],
    'Year': [2010, 2008, 2014, 2017, 2017],
    'Rating': [8.8, 9.0, 8.6, 7.9, 7.5]
}

# Load the sample data into a pandas DataFrame
df_introduction = pd.DataFrame(data)
df_introduction

In [None]:
# A column of a dataframe is called Series
type(df_introduction['Movie'])

In [None]:
# Accessing a column in a dataframe returns the Series
df_introduction['Movie']

In [None]:
# Filter the dataset to include only movies with a rating of 8.0 or higher
high_rated_movies = df_introduction[df_introduction['Rating'] >= 8.0]
high_rated_movies

In [None]:
# Create a new column 'Decade' based on the 'Year' column
df_introduction['Decade'] = (df_introduction['Year'] // 10) * 10
df_introduction

In [None]:
# Get the movie title with the highest rating
highest_rated_movie = df_introduction.loc[df_introduction['Rating'].idxmax(), 'Movie']
highest_rated_movie

### IMDB Dataset of Movie Reviews with Sentiments
The IMDB Dataset is a popular dataset used for natural language processing (NLP) and machine learning tasks, particularly for binary sentiment classification (positive/negative). It contains a large collection of movie reviews from the Internet Movie Database (IMDB), labeled according to their sentiment.

In [None]:
# Load the IMDB dataset
df = pd.read_csv(IMDB_FILE)
df

### Exercise 01: Data Inspection and Initial Understanding
Understand the structure and contents of the IMDB dataset. Inspect the data types, check for missing values, and analyze the distribution of labels (positive/negative reviews).
1. Display the first 5 rows and last 5 rows of the dataset.
2. Check the column names, data types, and the number of missing values for each column.
3. Count the number of positive and negative reviews.

**Hints**:
* Use `.head()` and `.tail()` to inspect the rows of a dataframe (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)).
* `.info()` will give you the column names and data types (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)).
* `.isna().sum()` will tell you if there are any missing values in the dataset (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html)).
* Use `.value_counts()` to count the distribution of positive and negative labels (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)).


In [None]:
# TODO: Your implementation

### Exercise 02: Show examples
Print out some examples for positive and negative reviews.
1. Print out 5 random reviews which have the label "0"
2. Print out 5 random reviews which have the label "1"

**Hints**:
* Use `.sample()` on the filtered dataframe (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)).
* You can create a python list from the samples by using `to_list()`
* Iterate over this list and print the elements

In [None]:
# Filter the dataframe by the 'label' column
filtered_label_0_reviews = df[df['label'] == 0]
filtered_label_1_reviews = df[df['label'] == 1]

# TODO: Your implementation

### Exercise 03: Minor data manipulation
Rename columns and values of the given dataframe.

1. Rename the column "label" to "sentiment"
2. Rename the column "text" to "review"
3. Rename values in column "sentiment" to "positive" and "negative"
4. Print the dataframe

**Hints**:
* Use `.rename()` for renaming columns (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html))
* Use `.map()` for updating values of a column (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html))

In [None]:
# TODO: Your implementation

### Exercise 04: Summary Statistics on Review Lengths
Analyze the length of the movie reviews in terms of character count. Calculate key summary statistics for the review lengths, including the average, maximum, and minimum lengths.
1. Add a new column called review_length that stores the length of each review (in characters).
2. Calculate the average, maximum, and minimum review lengths.
3. Find the median and standard deviation of the review lengths.
4. Find the longest and shortest reviews in the dataset.

**Hints**:
* Use `.apply(len)` to create the review_length column  (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)).
* Use `.mean()`, `.max()`, `.min()`, `.median()`, and `.std()` to get the desired statistics (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html)).
* To find the longest and shortest reviews, you can use `.idxmax()` and `.idxmin()` on the review_length column (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html)).


In [None]:
# TODO: Your implementation

### Exercise 05: Sentiment Distribution by length of review
Understand how review lengths vary by sentiment (positive/negative). Investigate whether longer reviews tend to be more positive or negative.
1. Group the reviews by sentiment (positive, negative).
2. Calculate the average, minimum and maximum review length for each sentiment.
3. Compare the distribution of review lengths between positive and negative reviews.

**Hints**:
* Use `.groupby()` to group the data by sentiment (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)).
* Calculate the mean length using `.mean()` and the maximum/minimum lengths with `.agg(['max', 'min'])` (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)).
* To compare distributions, you can use `.describe()` to summarize the statistics for each group (see [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)).

In [None]:
# TODO: Your implementation here
