# 01 - Data Exploration

**Goal:** Load and explore our datasets to understand what we're working with.

**What we'll learn:**
- How to load data with pandas
- How to view and inspect data
- How to understand the structure of our datasets

## Step 1: Import Libraries

First, we need to import the tools we'll use.

**What does `import` do?** It loads code that other people wrote so we can use it.

In [23]:
# Import pandas - our data manipulation library
# We call it 'pd' for short (this is a convention everyone uses)
import pandas as pd

# This line makes sure we can see all columns when we display data
pd.set_option('display.max_columns', None)

print("Libraries imported successfully!")

Libraries imported successfully!


## Step 2: Load the IMDB Dataset

We'll start with the IMDB dataset because it's the smallest and simplest.

**What is a DataFrame?** Think of it like an Excel spreadsheet - rows and columns of data.

In [24]:
# Load the IMDB dataset
# pd.read_csv() reads a CSV file and turns it into a DataFrame

imdb_path = '../data/raw/IMDB Dataset.csv'
imdb_df = pd.read_csv(imdb_path)

print("IMDB dataset loaded successfully!")
print(f"Number of rows: {len(imdb_df)}")

IMDB dataset loaded successfully!
Number of rows: 50000


## Step 3: Look at the Data

Let's see what the data actually looks like.

In [25]:
# .head() shows the first 5 rows
# This is always the first thing you do when exploring data

imdb_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [26]:
# .shape tells us (number of rows, number of columns)

print(f"Dataset shape: {imdb_df.shape}")
print(f"This means: {imdb_df.shape[0]} reviews and {imdb_df.shape[1]} columns")

Dataset shape: (50000, 2)
This means: 50000 reviews and 2 columns


In [27]:
# .columns shows us the column names

print("Columns in the dataset:")
print(imdb_df.columns.tolist())

Columns in the dataset:
['review', 'sentiment']


In [28]:
# .info() gives us a summary of the dataset
# - Column names
# - Data types
# - Non-null counts (helps spot missing data)

imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


## Step 4: Understand the Sentiment Labels

Let's see what values are in the 'sentiment' column.

In [29]:
# .value_counts() counts how many times each unique value appears
# This tells us if our dataset is balanced (equal positive and negative)

print("Sentiment distribution:")
print(imdb_df['sentiment'].value_counts())

Sentiment distribution:
sentiment
positive    25000
negative    25000
Name: count, dtype: int64


In [30]:
# Let's see the percentage breakdown

print("\nSentiment distribution (percentage):")
print(imdb_df['sentiment'].value_counts(normalize=True) * 100)


Sentiment distribution (percentage):
sentiment
positive    50.0
negative    50.0
Name: proportion, dtype: float64


## Step 5: Look at Some Actual Reviews

Let's read a few reviews to understand what we're working with.

In [31]:
# Look at a positive review
# .iloc[0] gets the first row, ['review'] gets the review column

positive_reviews = imdb_df[imdb_df['sentiment'] == 'positive']
print("=== POSITIVE REVIEW EXAMPLE ===")
print(positive_reviews.iloc[0]['review'][:500])  # First 500 characters
print("...")

=== POSITIVE REVIEW EXAMPLE ===
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ
...


In [32]:
# Look at a negative review

negative_reviews = imdb_df[imdb_df['sentiment'] == 'negative']
print("=== NEGATIVE REVIEW EXAMPLE ===")
print(negative_reviews.iloc[0]['review'][:500])  # First 500 characters
print("...")

=== NEGATIVE REVIEW EXAMPLE ===
Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins 
...


## Step 6: Check Review Lengths

How long are these reviews? This matters for our model later.

In [33]:
# Create a new column with the length of each review
# .apply() runs a function on each row
# len() counts the number of characters

imdb_df['review_length'] = imdb_df['review'].apply(len)

# .describe() gives us statistics
print("Review length statistics:")
print(imdb_df['review_length'].describe())

Review length statistics:
count    50000.000000
mean      1309.431020
std        989.728014
min         32.000000
25%        699.000000
50%        970.000000
75%       1590.250000
max      13704.000000
Name: review_length, dtype: float64


## Summary: What Did We Learn?

Fill this in based on what you observed:

1. **Dataset size:** _____ reviews
2. **Columns:** _____ columns (review text and sentiment label)
3. **Balance:** Is it balanced? (equal positive and negative?)
4. **Review lengths:** Average _____ characters, shortest _____, longest _____
5. **Any issues spotted?** (HTML tags, special characters, etc.)

---

## Next Steps

Now that we understand the data, we need to:
1. Clean the text (remove HTML, special characters)
2. Explore the Amazon and Yelp datasets
3. Prepare the data for modeling