# Data Processing in Databricks, leveraging Pandas, PySpark, and SQL

## Instructor: [Marcelino Mayorga Quesada](https://marcelinomayorga.com/)






# 1. Summary

## 1.1 Data Processing

- Data Processing is a series of operations to convert raw data into meaningful information.
- Is essential in Data Engineering for Prescriptive, Descriptive, and Exploratory Analysis.
- Post Processed data enables: storage to persist transformed data, analysis and machine learning.

## 1.2 Operations

All of them are applied based on need and objectives:

- Cleaning: 
  - Removing duplicates
  - Impute or delete missing values
  - Correct errors and inconsistencies
- Integration: 
  - ETL (Extract Transform Load)
  - Merge and Join data warehousing
  - Augmentation
- Transformation:
  - Normalization and Standardization
  - Aggregation (Summing, Averaging)
  - Pivoting tables
  - Encoding categorical values
- Reduction: 
  - Dimensionality Reduction: PCA, t-SNE, 
  - Feature Selection & Extraction
  - Sampling
  - Compression



## 1.3 Databricks

- Unified:  
  - Data Intelligence Platform 
  - Collaborative Workspace
  - Data Lake Integration with AWS, Azure, GCP.
- Open Source Projects:
  - Optimized Apache Spark
  - MLFlow
  - Delta Lake
-  Scalable 
  - Automatic Optimization for storage with great performance

## 1.4 Tool Comparison

![Tools](https://github.com/mmayorga97/dataprocessing_databricks/blob/main/imgs/tools.png?raw=true)





# 2. Lab

In this notebook, we will explore how to use Pandas, PySpark, and SQL for data processing within Databricks.





## 2.1 Details 


### 2.1.1 Data Workflow

![Diagram](https://github.com/mmayorga97/dataprocessing_databricks/blob/main/imgs/diagram.png?raw=true)


### 2.1.2 Data Source
We'll use Large Movie Review Dataset hosted in Hugging Face for this laboratory. Below are the details:

| Attribute | Value            |
|-----------|------------------|
| Source      | HuggingFace|
| Dataset      | [imdb](https://huggingface.co/datasets/stanfordnlp/imdb)|
| Columns(2) | text,label  |
| Purpose | Binary Sentiment Classification|
| Rows      | 25000|

### 2.1.3 Install required libraries

Let's install necessary libraries.

In [0]:
!pip install datasets nltk

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-d970d984-fdd4-4090-b6d3-f82ce0eaf58f/bin/python -m pip install --upgrade pip' command.[0m


### 2.1.4 Import necessary libraries

In [0]:
# Import pandas library for data manipulation and analysis
import pandas as pd

# Import pandas API on Spark
import pyspark.pandas as ps

# Import functions from PySpark SQL
import pyspark.sql.functions as F

# Import SparkSession from PySpark
from pyspark.sql import SparkSession

# Import SQLContext from PySpark
from pyspark.sql import SQLContext

# Import the load_dataset function from the datasets library
from datasets import load_dataset

# Import the nltk library
import nltk

# Import the stopwords corpus from nltk
from nltk.corpus import stopwords

# Import WordNetLemmatizer from nltk.stem for lemmatization
from nltk.stem import WordNetLemmatizer

# Import word_tokenize from nltk.tokenize for tokenizing text
from nltk.tokenize import word_tokenize

# Import the PorterStemmer from nltk
from nltk.stem import PorterStemmer

# Import the RegexpTokenizer from nltk
from nltk.tokenize import RegexpTokenizer

# Import the regular expressions module
import re


## 2.2 Data Ingest


### 2.2.1 Load Dataset in Memory

We'll leverage HuggingFace's datasets to retrieve IMDB dataset. This data is not persisted and will dissappear after the cluster termination or restart.

Notice the dataset's type of 'DatasetDict' and the operations are limited.


In [0]:
# Load the 'imdb' dataset using the load_dataset function
dataset = load_dataset('imdb')

# Check the type of the loaded dataset
type(dataset)

Out[87]: datasets.dataset_dict.DatasetDict

### 2.2.2 Load dataset into a Pandas Dataframe from Memory

We'll load the dataset into a the Pandas dataframe to unlock all the data manipulation features. Pandas is aimed to work on a single node.
The data used for this example is considered low volume data.

Notice how pd_df's type is Pandas Dataframe.

In [0]:
# Convert the 'train' portion of the dataset to a pandas DataFrame
pd_df = dataset['train'].to_pandas()

# Check the type of the converted pandas DataFrame
type(pd_df)

Out[88]: pandas.core.frame.DataFrame

### 2.2.3 Load Pandas Dataframe to a Pandas-on-Spark Dataframe

Now we'll load the Pandas Dataframe into a Pyspark Dataframe, that will allow us continue with familiar interface of Pandas while leveraging the distrubted nature of Spark.


In [0]:
# Convert the pandas DataFrame pd_df to a PySpark DataFrame ps_df
ps_df = ps.from_pandas(pd_df)

# Check the type of the converted PySpark DataFrame
type(ps_df)

Out[89]: pyspark.pandas.frame.DataFrame

### 2.5.4 Differences Between Pandas and Spark


| Pandas | Pyspark|
|-------|-------|
|DataFrames|DataFrames|
|Low Volume Data| High Volume Data|
|Single Computing | Distributed Computing|
|Eager Execution| Lazy Evaluation|
|N/A| Fault Tolerance|




## 2.3 Quick Exploratory Analysis with Pandas API

### 2.3.1 Data's Shape

The data contains 25K rows and 2 columns

In [0]:
# Get the number of columns and rows in the PySpark DataFrame ps_df
ps_df.shape

Out[90]: (25000, 2)

### 2.3.2 Column's Data Types

|Column|Type|
|------|----|
|text|object|
|label|int|

In [0]:
# Get the data types of each column in the PySpark DataFrame ps_df
ps_df.dtypes

Out[91]: text     object
label     int64
dtype: object

### 2.3.3 Summary Statistics

In [0]:
# Generate descriptive statistics for numerical columns in ps_df
ps_df.describe()

Unnamed: 0,label
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


### 2.3.4. Missing Values

No missing values

In [0]:
# Count the number of null values in each column of the DataFrame
ps_df.isnull().sum()

Out[93]: text     0
label    0
dtype: int64

### 2.3.5 Positive / Negative Review Ratio

The dataset is balanced between the two labels: Positive and Negative with 12500k each



In [0]:
# Count occurrences of each unique value in the 'label' column
ps_df['label'].value_counts()

Out[94]: 0    12500
1    12500
Name: label, dtype: int64

### 2.3.6 Samples

In [0]:
# Sample 0.02% of the DataFrame randomly
ps_df.sample(frac=0.0002)

Unnamed: 0,text,label
8341,"""Women? They're all scrubbers...!"" <br /><br /...",0


### 2.3.7 A full sample

In [0]:
# Example of text from a review
ps_df['text'][17370]

Out[96]: 'People don\'t seem to be giving Lensman enough credit where its due. A few issues have been overlooked which are key to understanding the Lensman experience.<br /><br />The Year: For the year it was made in (1984) Lensman features some of the most stunning effects I\'ve ever seen. As a person who watches a lot of early 80\'s animation Lensman is unique in it\'s use of what appears to be computer-generated imagery at a time when computers were extremely primitive. Kim\'s battle against the geometric cutter pods in the laser maze can be taken as an excellent example of this. Every time I watch that I have to keep repeating to myself that it was 1984 when it was made.<br /><br />The Soundtrack: Lensman has one of the most insane soundtracks that I\'ve heard, and this mad hysterical beat permeates every corner of the film. Lensman borrowed heavily on two western mistakes and managed to somewhat deal with the first one - the need to fill in every second of silence in a film with m

### 2.3.8 Data Summary 

After this quick exploratory data analysis we can conclude:
  - Dataset only handles 2 columns: 
    - one text as 'review' of the movie.
    - label to distinguish between positive and negative review.
  - There are no missing values.
  - There are no no duplicate values.
  - Both Labels (Positive & Negative) are balanced.

## 2.4 Data Processing for NLP

### 2.4.1 Remove Special Characters

In [0]:
# Remove non-alphabetic characters from 'text' and update 'cleaned_text' column
ps_df['cleaned_text'] = ps_df['text'].apply(lambda x: re.sub('[^a-zA-Z\s]', '', x))

# Display the first few rows of the DataFrame
ps_df.head()

Unnamed: 0,text,label,cleaned_text
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,I rented I AM CURIOUSYELLOW from my video stor...
1,"""I Am Curious: Yellow"" is a risible and preten...",0,I Am Curious Yellow is a risible and pretentio...
2,If only to avoid making this type of film in t...,0,If only to avoid making this type of film in t...
3,This film was probably inspired by Godard's Ma...,0,This film was probably inspired by Godards Mas...
4,"Oh, brother...after hearing about this ridicul...",0,Oh brotherafter hearing about this ridiculous ...


### 2.4.2 Convert to lower

In [0]:
# Convert text in 'cleaned_text' column to lowercase
ps_df['cleaned_text'] = ps_df['cleaned_text'].str.lower()

# Display the first few rows of the DataFrame
ps_df.head()


Unnamed: 0,text,label,cleaned_text
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,i rented i am curiousyellow from my video stor...
1,"""I Am Curious: Yellow"" is a risible and preten...",0,i am curious yellow is a risible and pretentio...
2,If only to avoid making this type of film in t...,0,if only to avoid making this type of film in t...
3,This film was probably inspired by Godard's Ma...,0,this film was probably inspired by godards mas...
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing about this ridiculous ...


### 2.4.3 Remove Stop Words with nltk

In [0]:
# Ensure you have the NLTK data downloaded
nltk.download('stopwords')

# Assuming ps_df is your DataFrame and 'cleaned_text' is the column with text data
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    return " ".join(filtered_words)

# Apply the function to the 'cleaned_text' column
ps_df['cleaned_text'] = ps_df['cleaned_text'].apply(remove_stopwords)

# Show results
ps_df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,label,cleaned_text
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,rented curiousyellow video store controversy s...
1,"""I Am Curious: Yellow"" is a risible and preten...",0,curious yellow risible pretentious steaming pi...
2,If only to avoid making this type of film in t...,0,avoid making type film future film interesting...
3,This film was probably inspired by Godard's Ma...,0,film probably inspired godards masculin fminin...
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing ridiculous film umptee...


### 2.4.4 Tokenize

In [0]:
# Initialize tokenizer to match words (alphanumeric characters)
tokenizer = RegexpTokenizer(r'\w+')

# Tokenize cleaned text and create a new column 'tokens'
ps_df['tokens'] = ps_df['cleaned_text'].apply(lambda x: tokenizer.tokenize(x))

# Display the first few rows of the DataFrame
ps_df.head()


Unnamed: 0,text,label,cleaned_text,tokens
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,rented curiousyellow video store controversy s...,"[rented, curiousyellow, video, store, controve..."
1,"""I Am Curious: Yellow"" is a risible and preten...",0,curious yellow risible pretentious steaming pi...,"[curious, yellow, risible, pretentious, steami..."
2,If only to avoid making this type of film in t...,0,avoid making type film future film interesting...,"[avoid, making, type, film, future, film, inte..."
3,This film was probably inspired by Godard's Ma...,0,film probably inspired godards masculin fminin...,"[film, probably, inspired, godards, masculin, ..."
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing ridiculous film umptee...,"[oh, brotherafter, hearing, ridiculous, film, ..."


### 2.4.5  Stemming
- Stemming is a process in Natural Language Processing (NLP) that reduces words to their root form or stem.

In [0]:
# Initialize the stemmer
stemmer = PorterStemmer()

# Stem each token in the 'tokens' column and create a new column 'stemmed_tokens'
ps_df['stemmed_tokens'] = ps_df['tokens'].apply(lambda x: [stemmer.stem(token) for token in x])

# Display the first few rows of the DataFrame
ps_df.head()


Unnamed: 0,text,label,cleaned_text,tokens,stemmed_tokens
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,rented curiousyellow video store controversy s...,"[rented, curiousyellow, video, store, controve...","[rent, curiousyellow, video, store, controvers..."
1,"""I Am Curious: Yellow"" is a risible and preten...",0,curious yellow risible pretentious steaming pi...,"[curious, yellow, risible, pretentious, steami...","[curiou, yellow, risibl, pretenti, steam, pile..."
2,If only to avoid making this type of film in t...,0,avoid making type film future film interesting...,"[avoid, making, type, film, future, film, inte...","[avoid, make, type, film, futur, film, interes..."
3,This film was probably inspired by Godard's Ma...,0,film probably inspired godards masculin fminin...,"[film, probably, inspired, godards, masculin, ...","[film, probabl, inspir, godard, masculin, fmin..."
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing ridiculous film umptee...,"[oh, brotherafter, hearing, ridiculous, film, ...","[oh, brotheraft, hear, ridicul, film, umpteen,..."


### 2.4.6  Lemmatization

- Lemmatization is a more sophisticated technique than stemming. It aims to reduce words to their base or dictionary form, known as the lemma.

In [0]:
# Ensure you have the NLTK data downloaded
nltk.download('wordnet')

# Assuming ps_df is your DataFrame and 'tokens' is the column with texts
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# Apply lemmatization to each row in the 'tokens' column
ps_df['lemmatized_tokens'] = ps_df['tokens'].apply(lemmatize_tokens)

# Display the first few rows of the DataFrame
ps_df.head()


[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,text,label,cleaned_text,tokens,stemmed_tokens,lemmatized_tokens
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,rented curiousyellow video store controversy s...,"[rented, curiousyellow, video, store, controve...","[rent, curiousyellow, video, store, controvers...","[rented, curiousyellow, video, store, controve..."
1,"""I Am Curious: Yellow"" is a risible and preten...",0,curious yellow risible pretentious steaming pi...,"[curious, yellow, risible, pretentious, steami...","[curiou, yellow, risibl, pretenti, steam, pile...","[curious, yellow, risible, pretentious, steami..."
2,If only to avoid making this type of film in t...,0,avoid making type film future film interesting...,"[avoid, making, type, film, future, film, inte...","[avoid, make, type, film, futur, film, interes...","[avoid, making, type, film, future, film, inte..."
3,This film was probably inspired by Godard's Ma...,0,film probably inspired godards masculin fminin...,"[film, probably, inspired, godards, masculin, ...","[film, probabl, inspir, godard, masculin, fmin...","[film, probably, inspired, godard, masculin, f..."
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing ridiculous film umptee...,"[oh, brotherafter, hearing, ridiculous, film, ...","[oh, brotheraft, hear, ridicul, film, umpteen,...","[oh, brotherafter, hearing, ridiculous, film, ..."


### 2.4.7 Get count of the contents for columns

- text
- cleaned_text
- tokens
- stemmed_tokens
- lemmatized_tokens

In [0]:
# Calculate count of each review in characters and create 'review_length' column
ps_df['text_count'] = ps_df['text'].apply(len)

# Calculate count of cleaned text in characters and create 'cleaned_text_length' column
ps_df['cleaned_text_count'] = ps_df['cleaned_text'].apply(len)

# Calculate number of tokens in each review and create 'tokens_length' column
ps_df['tokens_count'] = ps_df['tokens'].apply(len)

# Calculate number of stemmed tokens in each review and create 'stemmed_tokens_length' column
ps_df['stemmed_tokens_count'] = ps_df['stemmed_tokens'].apply(len)

# Calculate number of stemmed tokens in each review and create 'stemmed_tokens_length' column
ps_df['lemmatized_tokens_count'] = ps_df['lemmatized_tokens'].apply(len)

# Display the first few rows of the updated DataFrame
ps_df.head()

Unnamed: 0,text,label,cleaned_text,tokens,stemmed_tokens,lemmatized_tokens,review_length,cleaned_text_length,tokens_length,stemmed_tokens_length,lemmatized_tokens_length,text_count,cleaned_text_count,tokens_count,stemmed_tokens_count,lemmatized_tokens_count
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,rented curiousyellow video store controversy s...,"[rented, curiousyellow, video, store, controve...","[rent, curiousyellow, video, store, controvers...","[rented, curiousyellow, video, store, controve...",1640,1061,150,150,150,1640,1061,150,150,150
1,"""I Am Curious: Yellow"" is a risible and preten...",0,curious yellow risible pretentious steaming pi...,"[curious, yellow, risible, pretentious, steami...","[curiou, yellow, risibl, pretenti, steam, pile...","[curious, yellow, risible, pretentious, steami...",1294,871,120,120,120,1294,871,120,120,120
2,If only to avoid making this type of film in t...,0,avoid making type film future film interesting...,"[avoid, making, type, film, future, film, inte...","[avoid, make, type, film, futur, film, interes...","[avoid, making, type, film, future, film, inte...",528,346,52,52,52,528,346,52,52,52
3,This film was probably inspired by Godard's Ma...,0,film probably inspired godards masculin fminin...,"[film, probably, inspired, godards, masculin, ...","[film, probabl, inspir, godard, masculin, fmin...","[film, probably, inspired, godard, masculin, f...",706,428,58,58,58,706,428,58,58,58
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing ridiculous film umptee...,"[oh, brotherafter, hearing, ridiculous, film, ...","[oh, brotheraft, hear, ridicul, film, umpteen,...","[oh, brotherafter, hearing, ridiculous, film, ...",1814,1185,172,172,172,1814,1185,172,172,172


## 2.5. Create SQL Table with the processed data using Spark Dataframe

### 2.5.1 Load Pandas-on-Spark Dataframe to a Spark Dataframe


In [0]:
# Convert the PySpark DataFrame ps_df to a Spark DataFrame ps_spark_df
ps_spark_df = ps_df.to_spark()

# Check the type of the converted Spark DataFrame
type(ps_spark_df)

Out[119]: pyspark.sql.dataframe.DataFrame

### 2.5.2 Create SQL Table from Spark Dataframe


In [0]:
# Convert the PySpark DataFrame ps_df to a Spark DataFrame and create a temporary view
# named "imdb_prepared" in the Spark session
ps_df.to_spark().createOrReplaceTempView("imdb_prepared")


### 2.5.3. Query SQL Table

In [0]:
# Execute a SQL query on the "imdb_prepared" temporary view to select rows where
# review_length is less than 500, and limit the result to 5 rows
sql_result = spark.sql("SELECT * FROM imdb_prepared WHERE review_length < 500 LIMIT 5")

# Display the result using the display function (assuming display is defined)
display(sql_result)

text,label,cleaned_text,tokens,stemmed_tokens,lemmatized_tokens,review_length,cleaned_text_length,tokens_length,stemmed_tokens_length,lemmatized_tokens_length,text_count,cleaned_text_count,tokens_count,stemmed_tokens_count,lemmatized_tokens_count
"My interest in Dorothy Stratten caused me to purchase this video. Although it had great actors/actresses, there were just too many subplots going on to retain interest. Plus it just wasn't that interesting. Dialogue was stiff and confusing and the story just flipped around too much to be believable. I was pretty disappointed in what I believe was one of Audrey Hepburn's last movies. I'll always love John Ritter best in slapstick. He was just too pathetic here.",0,interest dorothy stratten caused purchase video although great actorsactresses many subplots going retain interest plus wasnt interesting dialogue stiff confusing story flipped around much believable pretty disappointed believe one audrey hepburns last movies ill always love john ritter best slapstick pathetic,"List(interest, dorothy, stratten, caused, purchase, video, although, great, actorsactresses, many, subplots, going, retain, interest, plus, wasnt, interesting, dialogue, stiff, confusing, story, flipped, around, much, believable, pretty, disappointed, believe, one, audrey, hepburns, last, movies, ill, always, love, john, ritter, best, slapstick, pathetic)","List(interest, dorothi, stratten, caus, purchas, video, although, great, actorsactress, mani, subplot, go, retain, interest, plu, wasnt, interest, dialogu, stiff, confus, stori, flip, around, much, believ, pretti, disappoint, believ, one, audrey, hepburn, last, movi, ill, alway, love, john, ritter, best, slapstick, pathet)","List(interest, dorothy, stratten, caused, purchase, video, although, great, actorsactresses, many, subplots, going, retain, interest, plus, wasnt, interesting, dialogue, stiff, confusing, story, flipped, around, much, believable, pretty, disappointed, believe, one, audrey, hepburn, last, movie, ill, always, love, john, ritter, best, slapstick, pathetic)",464,311,41,41,41,464,311,41,41,41
"I think I will make a movie next weekend. Oh wait, I'm working..oh I'm sure I can fit it in. It looks like whoever made this film fit it in. I hope the makers of this crap have day jobs because this film sucked!!! It looks like someones home movie and I don't think more than $100 was spent making it!!! Total crap!!! Who let's this stuff be released?!?!?!",0,think make movie next weekend oh wait im workingoh im sure fit looks like whoever made film fit hope makers crap day jobs film sucked looks like someones home movie dont think spent making total crap lets stuff released,"List(think, make, movie, next, weekend, oh, wait, im, workingoh, im, sure, fit, looks, like, whoever, made, film, fit, hope, makers, crap, day, jobs, film, sucked, looks, like, someones, home, movie, dont, think, spent, making, total, crap, lets, stuff, released)","List(think, make, movi, next, weekend, oh, wait, im, workingoh, im, sure, fit, look, like, whoever, made, film, fit, hope, maker, crap, day, job, film, suck, look, like, someon, home, movi, dont, think, spent, make, total, crap, let, stuff, releas)","List(think, make, movie, next, weekend, oh, wait, im, workingoh, im, sure, fit, look, like, whoever, made, film, fit, hope, maker, crap, day, job, film, sucked, look, like, someone, home, movie, dont, think, spent, making, total, crap, let, stuff, released)",356,219,39,39,39,356,219,39,39,39
Ned aKelly is such an important story to Australians but this movie is awful. It's an Australian story yet it seems like it was set in America. Also Ned was an Australian yet he has an Irish accent...it is the worst film I have seen in a long time,0,ned akelly important story australians movie awful australian story yet seems like set america also ned australian yet irish accentit worst film seen long time,"List(ned, akelly, important, story, australians, movie, awful, australian, story, yet, seems, like, set, america, also, ned, australian, yet, irish, accentit, worst, film, seen, long, time)","List(ned, akelli, import, stori, australian, movi, aw, australian, stori, yet, seem, like, set, america, also, ned, australian, yet, irish, accentit, worst, film, seen, long, time)","List(ned, akelly, important, story, australian, movie, awful, australian, story, yet, seems, like, set, america, also, ned, australian, yet, irish, accentit, worst, film, seen, long, time)",247,159,25,25,25,247,159,25,25,25
Protocol is an implausible movie whose only saving grace is that it stars Goldie Hawn along with a good cast of supporting actors. The story revolves around a ditzy cocktail waitress who becomes famous after inadvertently saving the life of an Arab dignitary. The story goes downhill halfway through the movie and Goldie's charm just doesn't save this movie. Unless you are a Goldie Hawn fan don't go out of your way to see this film.,0,protocol implausible movie whose saving grace stars goldie hawn along good cast supporting actors story revolves around ditzy cocktail waitress becomes famous inadvertently saving life arab dignitary story goes downhill halfway movie goldies charm doesnt save movie unless goldie hawn fan dont go way see film,"List(protocol, implausible, movie, whose, saving, grace, stars, goldie, hawn, along, good, cast, supporting, actors, story, revolves, around, ditzy, cocktail, waitress, becomes, famous, inadvertently, saving, life, arab, dignitary, story, goes, downhill, halfway, movie, goldies, charm, doesnt, save, movie, unless, goldie, hawn, fan, dont, go, way, see, film)","List(protocol, implaus, movi, whose, save, grace, star, goldi, hawn, along, good, cast, support, actor, stori, revolv, around, ditzi, cocktail, waitress, becom, famou, inadvert, save, life, arab, dignitari, stori, goe, downhil, halfway, movi, goldi, charm, doesnt, save, movi, unless, goldi, hawn, fan, dont, go, way, see, film)","List(protocol, implausible, movie, whose, saving, grace, star, goldie, hawn, along, good, cast, supporting, actor, story, revolves, around, ditzy, cocktail, waitress, becomes, famous, inadvertently, saving, life, arab, dignitary, story, go, downhill, halfway, movie, goldies, charm, doesnt, save, movie, unless, goldie, hawn, fan, dont, go, way, see, film)",434,309,46,46,46,434,309,46,46,46
Outlandish premise that rates low on plausibility and unfortunately also struggles feebly to raise laughs or interest. Only Hawn's well-known charm allows it to skate by on very thin ice. Goldie's gotta be a contender for an actress who's done so much in her career with very little quality material at her disposal...,0,outlandish premise rates low plausibility unfortunately also struggles feebly raise laughs interest hawns wellknown charm allows skate thin ice goldies gotta contender actress whos done much career little quality material disposalbr br,"List(outlandish, premise, rates, low, plausibility, unfortunately, also, struggles, feebly, raise, laughs, interest, hawns, wellknown, charm, allows, skate, thin, ice, goldies, gotta, contender, actress, whos, done, much, career, little, quality, material, disposalbr, br)","List(outlandish, premis, rate, low, plausibl, unfortun, also, struggl, feebli, rais, laugh, interest, hawn, wellknown, charm, allow, skate, thin, ice, goldi, gotta, contend, actress, who, done, much, career, littl, qualiti, materi, disposalbr, br)","List(outlandish, premise, rate, low, plausibility, unfortunately, also, struggle, feebly, raise, laugh, interest, hawns, wellknown, charm, allows, skate, thin, ice, goldies, gotta, contender, actress, who, done, much, career, little, quality, material, disposalbr, br)",330,235,32,32,32,330,235,32,32,32


### 2.6 Analysis and Visualizations through SQL 

In [0]:
%sql
 
-- Refresh the "imdb_prepared" table/view to ensure it reflects recent changes
REFRESH TABLE imdb_prepared;

-- Select up to 5 rows from the "imdb_prepared" table/view where review_length is less than 500
SELECT * FROM imdb_prepared WHERE review_length < 500 LIMIT 5;

text,label,cleaned_text,tokens,stemmed_tokens,lemmatized_tokens,review_length,cleaned_text_length,tokens_length,stemmed_tokens_length,lemmatized_tokens_length,text_count,cleaned_text_count,tokens_count,stemmed_tokens_count,lemmatized_tokens_count
"My interest in Dorothy Stratten caused me to purchase this video. Although it had great actors/actresses, there were just too many subplots going on to retain interest. Plus it just wasn't that interesting. Dialogue was stiff and confusing and the story just flipped around too much to be believable. I was pretty disappointed in what I believe was one of Audrey Hepburn's last movies. I'll always love John Ritter best in slapstick. He was just too pathetic here.",0,interest dorothy stratten caused purchase video although great actorsactresses many subplots going retain interest plus wasnt interesting dialogue stiff confusing story flipped around much believable pretty disappointed believe one audrey hepburns last movies ill always love john ritter best slapstick pathetic,"List(interest, dorothy, stratten, caused, purchase, video, although, great, actorsactresses, many, subplots, going, retain, interest, plus, wasnt, interesting, dialogue, stiff, confusing, story, flipped, around, much, believable, pretty, disappointed, believe, one, audrey, hepburns, last, movies, ill, always, love, john, ritter, best, slapstick, pathetic)","List(interest, dorothi, stratten, caus, purchas, video, although, great, actorsactress, mani, subplot, go, retain, interest, plu, wasnt, interest, dialogu, stiff, confus, stori, flip, around, much, believ, pretti, disappoint, believ, one, audrey, hepburn, last, movi, ill, alway, love, john, ritter, best, slapstick, pathet)","List(interest, dorothy, stratten, caused, purchase, video, although, great, actorsactresses, many, subplots, going, retain, interest, plus, wasnt, interesting, dialogue, stiff, confusing, story, flipped, around, much, believable, pretty, disappointed, believe, one, audrey, hepburn, last, movie, ill, always, love, john, ritter, best, slapstick, pathetic)",464,311,41,41,41,464,311,41,41,41
"I think I will make a movie next weekend. Oh wait, I'm working..oh I'm sure I can fit it in. It looks like whoever made this film fit it in. I hope the makers of this crap have day jobs because this film sucked!!! It looks like someones home movie and I don't think more than $100 was spent making it!!! Total crap!!! Who let's this stuff be released?!?!?!",0,think make movie next weekend oh wait im workingoh im sure fit looks like whoever made film fit hope makers crap day jobs film sucked looks like someones home movie dont think spent making total crap lets stuff released,"List(think, make, movie, next, weekend, oh, wait, im, workingoh, im, sure, fit, looks, like, whoever, made, film, fit, hope, makers, crap, day, jobs, film, sucked, looks, like, someones, home, movie, dont, think, spent, making, total, crap, lets, stuff, released)","List(think, make, movi, next, weekend, oh, wait, im, workingoh, im, sure, fit, look, like, whoever, made, film, fit, hope, maker, crap, day, job, film, suck, look, like, someon, home, movi, dont, think, spent, make, total, crap, let, stuff, releas)","List(think, make, movie, next, weekend, oh, wait, im, workingoh, im, sure, fit, look, like, whoever, made, film, fit, hope, maker, crap, day, job, film, sucked, look, like, someone, home, movie, dont, think, spent, making, total, crap, let, stuff, released)",356,219,39,39,39,356,219,39,39,39
Ned aKelly is such an important story to Australians but this movie is awful. It's an Australian story yet it seems like it was set in America. Also Ned was an Australian yet he has an Irish accent...it is the worst film I have seen in a long time,0,ned akelly important story australians movie awful australian story yet seems like set america also ned australian yet irish accentit worst film seen long time,"List(ned, akelly, important, story, australians, movie, awful, australian, story, yet, seems, like, set, america, also, ned, australian, yet, irish, accentit, worst, film, seen, long, time)","List(ned, akelli, import, stori, australian, movi, aw, australian, stori, yet, seem, like, set, america, also, ned, australian, yet, irish, accentit, worst, film, seen, long, time)","List(ned, akelly, important, story, australian, movie, awful, australian, story, yet, seems, like, set, america, also, ned, australian, yet, irish, accentit, worst, film, seen, long, time)",247,159,25,25,25,247,159,25,25,25
Protocol is an implausible movie whose only saving grace is that it stars Goldie Hawn along with a good cast of supporting actors. The story revolves around a ditzy cocktail waitress who becomes famous after inadvertently saving the life of an Arab dignitary. The story goes downhill halfway through the movie and Goldie's charm just doesn't save this movie. Unless you are a Goldie Hawn fan don't go out of your way to see this film.,0,protocol implausible movie whose saving grace stars goldie hawn along good cast supporting actors story revolves around ditzy cocktail waitress becomes famous inadvertently saving life arab dignitary story goes downhill halfway movie goldies charm doesnt save movie unless goldie hawn fan dont go way see film,"List(protocol, implausible, movie, whose, saving, grace, stars, goldie, hawn, along, good, cast, supporting, actors, story, revolves, around, ditzy, cocktail, waitress, becomes, famous, inadvertently, saving, life, arab, dignitary, story, goes, downhill, halfway, movie, goldies, charm, doesnt, save, movie, unless, goldie, hawn, fan, dont, go, way, see, film)","List(protocol, implaus, movi, whose, save, grace, star, goldi, hawn, along, good, cast, support, actor, stori, revolv, around, ditzi, cocktail, waitress, becom, famou, inadvert, save, life, arab, dignitari, stori, goe, downhil, halfway, movi, goldi, charm, doesnt, save, movi, unless, goldi, hawn, fan, dont, go, way, see, film)","List(protocol, implausible, movie, whose, saving, grace, star, goldie, hawn, along, good, cast, supporting, actor, story, revolves, around, ditzy, cocktail, waitress, becomes, famous, inadvertently, saving, life, arab, dignitary, story, go, downhill, halfway, movie, goldies, charm, doesnt, save, movie, unless, goldie, hawn, fan, dont, go, way, see, film)",434,309,46,46,46,434,309,46,46,46
Outlandish premise that rates low on plausibility and unfortunately also struggles feebly to raise laughs or interest. Only Hawn's well-known charm allows it to skate by on very thin ice. Goldie's gotta be a contender for an actress who's done so much in her career with very little quality material at her disposal...,0,outlandish premise rates low plausibility unfortunately also struggles feebly raise laughs interest hawns wellknown charm allows skate thin ice goldies gotta contender actress whos done much career little quality material disposalbr br,"List(outlandish, premise, rates, low, plausibility, unfortunately, also, struggles, feebly, raise, laughs, interest, hawns, wellknown, charm, allows, skate, thin, ice, goldies, gotta, contender, actress, whos, done, much, career, little, quality, material, disposalbr, br)","List(outlandish, premis, rate, low, plausibl, unfortun, also, struggl, feebli, rais, laugh, interest, hawn, wellknown, charm, allow, skate, thin, ice, goldi, gotta, contend, actress, who, done, much, career, littl, qualiti, materi, disposalbr, br)","List(outlandish, premise, rate, low, plausibility, unfortunately, also, struggle, feebly, raise, laugh, interest, hawns, wellknown, charm, allows, skate, thin, ice, goldies, gotta, contender, actress, who, done, much, career, little, quality, material, disposalbr, br)",330,235,32,32,32,330,235,32,32,32


# Conclusion

### Pandas
- **Use Cases**: Small to medium-sized datasets, local data analysis, quick prototyping.
- **Advantages**: Easy to use, rich functionality, excellent for in-memory operations.
- **Disadvantages**: Not suitable for very large datasets due to memory constraints.

### PySpark
- **Use Cases**: Large datasets, distributed data processing, big data analytics.
- **Advantages**: Scalable, can handle large datasets, integrates well with Hadoop.
- **Disadvantages**: More complex than Pandas, requires a Spark cluster.

### SQL
- **Use Cases**: Data querying, reporting, integration with BI tools.
- **Advantages**: Familiarity for users with SQL background, powerful for data retrieval and manipulation.
- **Disadvantages**: Limited to SQL operations, may require additional steps for complex data manipulations.