# Data Processing in Databricks, leveraging Pandas, PySpark, and SQL

## Instructor: Marcelino Mayorga Quesada






# 1. Summary

## 1.1 Data Processing

- Data Processing is a series of operations to convert raw data into meaningful information.
- Is essential in Data Engineering for Prescriptive, Descriptive, and Exploratory Analysis.
- Post Processed data enables: storage to persist transformed data, analysis and machine learning.

## 1.2 Operations

All of them are applied based on need and objectives:

- Cleaning: 
  - Removing duplicates
  - Impute or delete missing values
  - Correct errors and inconsistencies
- Integration: 
  - ETL (Extract Transform Load)
  - Merge and Join data warehousing
  - Augmentation
- Transformation:
  - Normalization and Standardization
  - Aggregation (Summing, Averaging)
  - Pivoting tables
  - Encoding categorical values
- Reduction: 
  - Dimensionality Reduction: PCA, t-SNE, 
  - Feature Selection & Extraction
  - Sampling
  - Compression



## 1.3 Databricks

- Unified:  
  - Data Intelligence Platform 
  - Collaborative Workspace
  - Data Lake Integration with AWS, Azure, GCP.
- Open Source Projects:
  - Optimized Apache Spark
  - MLFlow
  - Delta Lake
-  Scalable 
  - Automatic Optimization for storage with great performance

## 1.4 Tools

![Tools](https://example.com/path/to/image.jpg)





# 2. Lab

In this notebook, we will explore how to use Pandas, PySpark, and SQL for data processing within Databricks.

![Diagram](https://example.com/path/to/image.jpg)


## 2.1 Data Source
We'll use a Hugging Face dataset for this laboratory. Below are the details:

| Attribute | Value            |
|-----------|------------------|
| Source      | HuggingFace|
| Dataset      | [imdb](https://huggingface.co/datasets/stanfordnlp/imdb)|
| Columns(2) | text,label  |
| Purpose | Binary Sentiment Classification|
| Rows      | 25000|





## 2.1 Install required libraries

Let's install necessary libraries.

In [0]:
!pip install datasets nltk

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-bc3e0f46-8e7c-4625-bc1d-aa086c017775/bin/python -m pip install --upgrade pip' command.[0m


## 2.2 Import necessary libraries

In [0]:
import pandas as pd
import pyspark.pandas as ps
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from datasets import load_dataset


## 2.3 Data Ingest


### 2.3.1 Load Dataset in Memory

We'll leverage HuggingFace's datasets to retrieve IMDB dataset. This data is not persisted and will dissappear after the cluster termination or restart.

Notice the dataset's type of 'DatasetDict' and the operations are limited.


In [0]:
dataset = load_dataset('imdb')
type(dataset)

Out[42]: datasets.dataset_dict.DatasetDict

### 2.3.1 Load dataset into a Pandas Dataframe from Memory

We'll load the dataset into a the Pandas dataframe to unlock all the data manipulation features. Pandas is aimed to work on a single node.
The data used for this example is considered low volume data.

Notice how pd_df's type is Pandas Dataframe.

In [0]:
pd_df = dataset['train'].to_pandas()
type(pd_df)

Out[43]: pandas.core.frame.DataFrame

### 2.3.2 Load Pandas Dataframe to a Pandas on Spark Dataframe

Now we'll load the Pandas Dataframe into a Pyspark Dataframe, that will allow us continue with familiar interface of Pandas while leveraging the distrubted nature of Spark.


In [0]:
ps_df = ps.from_pandas(pd_df)
type(ps_df)


Out[44]: pyspark.pandas.frame.DataFrame

## 2.5 Differences Between Pandas and Spark


| Pandas | Pyspark|
|-------|-------|
|DataFrames|DataFrames|
|Low Volume Data| High Volume Data|
|Single Computing | Distributed Computing|
|Eager Execution| Lazy Evaluation|
|N/A| Fault Tolerance|




## 2.5 Quick Exploratory Analysis

### 2.5.1 Data's Shape

The data contains 25K rows and 2 columns

In [0]:
ps_df.shape

Out[45]: (25000, 2)

### 2.5.2 Column's Data Types

|Column|Type|
|------|----|
|text|object|
|label|int|

In [0]:
ps_df.dtypes

Out[46]: text     object
label     int64
dtype: object

### 2.5.3 Summary Statistics

In [0]:
ps_df.describe()

Unnamed: 0,label
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


### 2.5.3. Missing Values

No missing values

In [0]:
ps_df.isnull().sum()

Out[48]: text     0
label    0
dtype: int64

### 2.5.6 Positive / Negative Review Ratio

The dataset is balanced between the two labels: Positive and Negative with 12500k each



In [0]:
ps_df['label'].value_counts()

Out[49]: 0    12500
1    12500
Name: label, dtype: int64

### 2.5.1 Samples

In [0]:
ps_df.sample(frac=0.0002) # Percentage

Unnamed: 0,text,label
15118,I wasn't sure when I heard about this coming o...,1
16950,This sweeping drama has it all: top notch acti...,1
24980,I was pleased to see that she had black hair! ...,1


### 2.5.7 A full sample

In [0]:
# Example of text from a review
ps_df['text'][17370]

Out[51]: 'People don\'t seem to be giving Lensman enough credit where its due. A few issues have been overlooked which are key to understanding the Lensman experience.<br /><br />The Year: For the year it was made in (1984) Lensman features some of the most stunning effects I\'ve ever seen. As a person who watches a lot of early 80\'s animation Lensman is unique in it\'s use of what appears to be computer-generated imagery at a time when computers were extremely primitive. Kim\'s battle against the geometric cutter pods in the laser maze can be taken as an excellent example of this. Every time I watch that I have to keep repeating to myself that it was 1984 when it was made.<br /><br />The Soundtrack: Lensman has one of the most insane soundtracks that I\'ve heard, and this mad hysterical beat permeates every corner of the film. Lensman borrowed heavily on two western mistakes and managed to somewhat deal with the first one - the need to fill in every second of silence in a film with m

## 2.6 Data Summary 

After this quick exploratory data analysis we can conclude:
  - Dataset only handles 2 columns: one text and label to distinguish between positive and negative review.
  - There are no missing values.
  - There are no No duplicate values.
  - Both Labels (Positive & Negative) are balanced.

## 3. Data Processing for NLP

### 3.1 Remove Special Characters

In [0]:
import re
ps_df['cleaned_text'] = ps_df['text'].apply(lambda x: re.sub('[^a-zA-Z\s]', '', x))
ps_df.head()

Unnamed: 0,text,label,cleaned_text
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,I rented I AM CURIOUSYELLOW from my video stor...
1,"""I Am Curious: Yellow"" is a risible and preten...",0,I Am Curious Yellow is a risible and pretentio...
2,If only to avoid making this type of film in t...,0,If only to avoid making this type of film in t...
3,This film was probably inspired by Godard's Ma...,0,This film was probably inspired by Godards Mas...
4,"Oh, brother...after hearing about this ridicul...",0,Oh brotherafter hearing about this ridiculous ...


### 3.2 Convert to lower

In [0]:
ps_df['cleaned_text'] = ps_df['cleaned_text'].str.lower()
ps_df.head()

Unnamed: 0,text,label,cleaned_text
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,i rented i am curiousyellow from my video stor...
1,"""I Am Curious: Yellow"" is a risible and preten...",0,i am curious yellow is a risible and pretentio...
2,If only to avoid making this type of film in t...,0,if only to avoid making this type of film in t...
3,This film was probably inspired by Godard's Ma...,0,this film was probably inspired by godards mas...
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing about this ridiculous ...


### 3.3 Remove Stop Words

###  3.4 Tokenize

In [0]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
ps_df['tokens'] = ps_df['cleaned_text'].apply(lambda x: tokenizer.tokenize(x))
#ps_df.sample(frac=0.0002) # Percentage
ps_df.head()



Unnamed: 0,text,label,cleaned_text,tokens
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,i rented i am curiousyellow from my video stor...,"[i, rented, i, am, curiousyellow, from, my, vi..."
1,"""I Am Curious: Yellow"" is a risible and preten...",0,i am curious yellow is a risible and pretentio...,"[i, am, curious, yellow, is, a, risible, and, ..."
2,If only to avoid making this type of film in t...,0,if only to avoid making this type of film in t...,"[if, only, to, avoid, making, this, type, of, ..."
3,This film was probably inspired by Godard's Ma...,0,this film was probably inspired by godards mas...,"[this, film, was, probably, inspired, by, goda..."
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing about this ridiculous ...,"[oh, brotherafter, hearing, about, this, ridic..."


### 3.5 Stemming or Lemmatization

In [0]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
ps_df['stemmed_tokens'] = ps_df['tokens'].apply(lambda x: [stemmer.stem(token) for token in x])
ps_df.head()

Unnamed: 0,text,label,cleaned_text,tokens,stemmed_tokens
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,i rented i am curiousyellow from my video stor...,"[i, rented, i, am, curiousyellow, from, my, vi...","[i, rent, i, am, curiousyellow, from, my, vide..."
1,"""I Am Curious: Yellow"" is a risible and preten...",0,i am curious yellow is a risible and pretentio...,"[i, am, curious, yellow, is, a, risible, and, ...","[i, am, curiou, yellow, is, a, risibl, and, pr..."
2,If only to avoid making this type of film in t...,0,if only to avoid making this type of film in t...,"[if, only, to, avoid, making, this, type, of, ...","[if, onli, to, avoid, make, thi, type, of, fil..."
3,This film was probably inspired by Godard's Ma...,0,this film was probably inspired by godards mas...,"[this, film, was, probably, inspired, by, goda...","[thi, film, wa, probabl, inspir, by, godard, m..."
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing about this ridiculous ...,"[oh, brotherafter, hearing, about, this, ridic...","[oh, brotheraft, hear, about, thi, ridicul, fi..."


### 3.6 Length on: Review, cleaned_text, tokens and stemmed_tokens

In [0]:
ps_df['review_length'] = ps_df['text'].apply(len)
ps_df['cleaned_text_length'] = ps_df['cleaned_text'].apply(len)
ps_df['tokens_length'] = ps_df['tokens'].apply(len)
ps_df['stemmed_tokens_length'] = ps_df['stemmed_tokens'].apply(len)
ps_df.head()

Unnamed: 0,text,label,cleaned_text,tokens,stemmed_tokens,review_length,cleaned_text_length,tokens_length,stemmed_tokens_length
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,i rented i am curiousyellow from my video stor...,"[i, rented, i, am, curiousyellow, from, my, vi...","[i, rent, i, am, curiousyellow, from, my, vide...",1640,1582,286,286
1,"""I Am Curious: Yellow"" is a risible and preten...",0,i am curious yellow is a risible and pretentio...,"[i, am, curious, yellow, is, a, risible, and, ...","[i, am, curiou, yellow, is, a, risibl, and, pr...",1294,1249,214,214
2,If only to avoid making this type of film in t...,0,if only to avoid making this type of film in t...,"[if, only, to, avoid, making, this, type, of, ...","[if, onli, to, avoid, make, thi, type, of, fil...",528,500,92,92
3,This film was probably inspired by Godard's Ma...,0,this film was probably inspired by godards mas...,"[this, film, was, probably, inspired, by, goda...","[thi, film, wa, probabl, inspir, by, godard, m...",706,659,115,115
4,"Oh, brother...after hearing about this ridicul...",0,oh brotherafter hearing about this ridiculous ...,"[oh, brotherafter, hearing, about, this, ridic...","[oh, brotheraft, hear, about, thi, ridicul, fi...",1814,1686,306,306


## 4. Create SQL Table with dataprocessed data using Spark Dataframe

### 4.1 Load Pandas on Spark Dataframe to a Spark Dataframe


In [0]:
ps_spark_df = ps_df.to_spark()
type(ps_spark_df)

Out[57]: pyspark.sql.dataframe.DataFrame

### 4.2 Create SQL Table from Spark Dataframe


In [0]:
#table_name = 'imdb_prepared'
#dbutils.fs.rm("dbfs:/user/hive/warehouse/"+table_name,True)
#spark.sql("DROP TABLE IF EXISTS " + table_name)
#ps_df.to_spark().write.format("parquet").saveAsTable(table_name)

ps_df.to_spark().createOrReplaceTempView("imdb_prepared")


## 5. Validate and Query SQL Table

In [0]:
sql_result = spark.sql("SELECT * FROM imdb_prepared WHERE review_length < 500 LIMIT 5")
display(sql_result)


text,label,cleaned_text,tokens,stemmed_tokens,review_length,cleaned_text_length,tokens_length,stemmed_tokens_length
"My interest in Dorothy Stratten caused me to purchase this video. Although it had great actors/actresses, there were just too many subplots going on to retain interest. Plus it just wasn't that interesting. Dialogue was stiff and confusing and the story just flipped around too much to be believable. I was pretty disappointed in what I believe was one of Audrey Hepburn's last movies. I'll always love John Ritter best in slapstick. He was just too pathetic here.",0,my interest in dorothy stratten caused me to purchase this video although it had great actorsactresses there were just too many subplots going on to retain interest plus it just wasnt that interesting dialogue was stiff and confusing and the story just flipped around too much to be believable i was pretty disappointed in what i believe was one of audrey hepburns last movies ill always love john ritter best in slapstick he was just too pathetic here,"List(my, interest, in, dorothy, stratten, caused, me, to, purchase, this, video, although, it, had, great, actorsactresses, there, were, just, too, many, subplots, going, on, to, retain, interest, plus, it, just, wasnt, that, interesting, dialogue, was, stiff, and, confusing, and, the, story, just, flipped, around, too, much, to, be, believable, i, was, pretty, disappointed, in, what, i, believe, was, one, of, audrey, hepburns, last, movies, ill, always, love, john, ritter, best, in, slapstick, he, was, just, too, pathetic, here)","List(my, interest, in, dorothi, stratten, caus, me, to, purchas, thi, video, although, it, had, great, actorsactress, there, were, just, too, mani, subplot, go, on, to, retain, interest, plu, it, just, wasnt, that, interest, dialogu, wa, stiff, and, confus, and, the, stori, just, flip, around, too, much, to, be, believ, i, wa, pretti, disappoint, in, what, i, believ, wa, one, of, audrey, hepburn, last, movi, ill, alway, love, john, ritter, best, in, slapstick, he, wa, just, too, pathet, here)",464,452,78,78
"I think I will make a movie next weekend. Oh wait, I'm working..oh I'm sure I can fit it in. It looks like whoever made this film fit it in. I hope the makers of this crap have day jobs because this film sucked!!! It looks like someones home movie and I don't think more than $100 was spent making it!!! Total crap!!! Who let's this stuff be released?!?!?!",0,i think i will make a movie next weekend oh wait im workingoh im sure i can fit it in it looks like whoever made this film fit it in i hope the makers of this crap have day jobs because this film sucked it looks like someones home movie and i dont think more than was spent making it total crap who lets this stuff be released,"List(i, think, i, will, make, a, movie, next, weekend, oh, wait, im, workingoh, im, sure, i, can, fit, it, in, it, looks, like, whoever, made, this, film, fit, it, in, i, hope, the, makers, of, this, crap, have, day, jobs, because, this, film, sucked, it, looks, like, someones, home, movie, and, i, dont, think, more, than, was, spent, making, it, total, crap, who, lets, this, stuff, be, released)","List(i, think, i, will, make, a, movi, next, weekend, oh, wait, im, workingoh, im, sure, i, can, fit, it, in, it, look, like, whoever, made, thi, film, fit, it, in, i, hope, the, maker, of, thi, crap, have, day, job, becaus, thi, film, suck, it, look, like, someon, home, movi, and, i, dont, think, more, than, wa, spent, make, it, total, crap, who, let, thi, stuff, be, releas)",356,327,68,68
Ned aKelly is such an important story to Australians but this movie is awful. It's an Australian story yet it seems like it was set in America. Also Ned was an Australian yet he has an Irish accent...it is the worst film I have seen in a long time,0,ned akelly is such an important story to australians but this movie is awful its an australian story yet it seems like it was set in america also ned was an australian yet he has an irish accentit is the worst film i have seen in a long time,"List(ned, akelly, is, such, an, important, story, to, australians, but, this, movie, is, awful, its, an, australian, story, yet, it, seems, like, it, was, set, in, america, also, ned, was, an, australian, yet, he, has, an, irish, accentit, is, the, worst, film, i, have, seen, in, a, long, time)","List(ned, akelli, is, such, an, import, stori, to, australian, but, thi, movi, is, aw, it, an, australian, stori, yet, it, seem, like, it, wa, set, in, america, also, ned, wa, an, australian, yet, he, ha, an, irish, accentit, is, the, worst, film, i, have, seen, in, a, long, time)",247,241,49,49
Protocol is an implausible movie whose only saving grace is that it stars Goldie Hawn along with a good cast of supporting actors. The story revolves around a ditzy cocktail waitress who becomes famous after inadvertently saving the life of an Arab dignitary. The story goes downhill halfway through the movie and Goldie's charm just doesn't save this movie. Unless you are a Goldie Hawn fan don't go out of your way to see this film.,0,protocol is an implausible movie whose only saving grace is that it stars goldie hawn along with a good cast of supporting actors the story revolves around a ditzy cocktail waitress who becomes famous after inadvertently saving the life of an arab dignitary the story goes downhill halfway through the movie and goldies charm just doesnt save this movie unless you are a goldie hawn fan dont go out of your way to see this film,"List(protocol, is, an, implausible, movie, whose, only, saving, grace, is, that, it, stars, goldie, hawn, along, with, a, good, cast, of, supporting, actors, the, story, revolves, around, a, ditzy, cocktail, waitress, who, becomes, famous, after, inadvertently, saving, the, life, of, an, arab, dignitary, the, story, goes, downhill, halfway, through, the, movie, and, goldies, charm, just, doesnt, save, this, movie, unless, you, are, a, goldie, hawn, fan, dont, go, out, of, your, way, to, see, this, film)","List(protocol, is, an, implaus, movi, whose, onli, save, grace, is, that, it, star, goldi, hawn, along, with, a, good, cast, of, support, actor, the, stori, revolv, around, a, ditzi, cocktail, waitress, who, becom, famou, after, inadvert, save, the, life, of, an, arab, dignitari, the, stori, goe, downhil, halfway, through, the, movi, and, goldi, charm, just, doesnt, save, thi, movi, unless, you, are, a, goldi, hawn, fan, dont, go, out, of, your, way, to, see, thi, film)",434,427,76,76
Outlandish premise that rates low on plausibility and unfortunately also struggles feebly to raise laughs or interest. Only Hawn's well-known charm allows it to skate by on very thin ice. Goldie's gotta be a contender for an actress who's done so much in her career with very little quality material at her disposal...,0,outlandish premise that rates low on plausibility and unfortunately also struggles feebly to raise laughs or interest only hawns wellknown charm allows it to skate by on very thin ice goldies gotta be a contender for an actress whos done so much in her career with very little quality material at her disposalbr br,"List(outlandish, premise, that, rates, low, on, plausibility, and, unfortunately, also, struggles, feebly, to, raise, laughs, or, interest, only, hawns, wellknown, charm, allows, it, to, skate, by, on, very, thin, ice, goldies, gotta, be, a, contender, for, an, actress, whos, done, so, much, in, her, career, with, very, little, quality, material, at, her, disposalbr, br)","List(outlandish, premis, that, rate, low, on, plausibl, and, unfortun, also, struggl, feebli, to, rais, laugh, or, interest, onli, hawn, wellknown, charm, allow, it, to, skate, by, on, veri, thin, ice, goldi, gotta, be, a, contend, for, an, actress, who, done, so, much, in, her, career, with, veri, littl, qualiti, materi, at, her, disposalbr, br)",330,315,54,54


In [0]:
%sql
 
REFRESH TABLE imdb_prepared;
SELECT * FROM imdb_prepared WHERE review_length < 500 LIMIT 5

text,label,cleaned_text,tokens,stemmed_tokens,review_length,cleaned_text_length,tokens_length,stemmed_tokens_length
"My interest in Dorothy Stratten caused me to purchase this video. Although it had great actors/actresses, there were just too many subplots going on to retain interest. Plus it just wasn't that interesting. Dialogue was stiff and confusing and the story just flipped around too much to be believable. I was pretty disappointed in what I believe was one of Audrey Hepburn's last movies. I'll always love John Ritter best in slapstick. He was just too pathetic here.",0,my interest in dorothy stratten caused me to purchase this video although it had great actorsactresses there were just too many subplots going on to retain interest plus it just wasnt that interesting dialogue was stiff and confusing and the story just flipped around too much to be believable i was pretty disappointed in what i believe was one of audrey hepburns last movies ill always love john ritter best in slapstick he was just too pathetic here,"List(my, interest, in, dorothy, stratten, caused, me, to, purchase, this, video, although, it, had, great, actorsactresses, there, were, just, too, many, subplots, going, on, to, retain, interest, plus, it, just, wasnt, that, interesting, dialogue, was, stiff, and, confusing, and, the, story, just, flipped, around, too, much, to, be, believable, i, was, pretty, disappointed, in, what, i, believe, was, one, of, audrey, hepburns, last, movies, ill, always, love, john, ritter, best, in, slapstick, he, was, just, too, pathetic, here)","List(my, interest, in, dorothi, stratten, caus, me, to, purchas, thi, video, although, it, had, great, actorsactress, there, were, just, too, mani, subplot, go, on, to, retain, interest, plu, it, just, wasnt, that, interest, dialogu, wa, stiff, and, confus, and, the, stori, just, flip, around, too, much, to, be, believ, i, wa, pretti, disappoint, in, what, i, believ, wa, one, of, audrey, hepburn, last, movi, ill, alway, love, john, ritter, best, in, slapstick, he, wa, just, too, pathet, here)",464,452,78,78
"I think I will make a movie next weekend. Oh wait, I'm working..oh I'm sure I can fit it in. It looks like whoever made this film fit it in. I hope the makers of this crap have day jobs because this film sucked!!! It looks like someones home movie and I don't think more than $100 was spent making it!!! Total crap!!! Who let's this stuff be released?!?!?!",0,i think i will make a movie next weekend oh wait im workingoh im sure i can fit it in it looks like whoever made this film fit it in i hope the makers of this crap have day jobs because this film sucked it looks like someones home movie and i dont think more than was spent making it total crap who lets this stuff be released,"List(i, think, i, will, make, a, movie, next, weekend, oh, wait, im, workingoh, im, sure, i, can, fit, it, in, it, looks, like, whoever, made, this, film, fit, it, in, i, hope, the, makers, of, this, crap, have, day, jobs, because, this, film, sucked, it, looks, like, someones, home, movie, and, i, dont, think, more, than, was, spent, making, it, total, crap, who, lets, this, stuff, be, released)","List(i, think, i, will, make, a, movi, next, weekend, oh, wait, im, workingoh, im, sure, i, can, fit, it, in, it, look, like, whoever, made, thi, film, fit, it, in, i, hope, the, maker, of, thi, crap, have, day, job, becaus, thi, film, suck, it, look, like, someon, home, movi, and, i, dont, think, more, than, wa, spent, make, it, total, crap, who, let, thi, stuff, be, releas)",356,327,68,68
Ned aKelly is such an important story to Australians but this movie is awful. It's an Australian story yet it seems like it was set in America. Also Ned was an Australian yet he has an Irish accent...it is the worst film I have seen in a long time,0,ned akelly is such an important story to australians but this movie is awful its an australian story yet it seems like it was set in america also ned was an australian yet he has an irish accentit is the worst film i have seen in a long time,"List(ned, akelly, is, such, an, important, story, to, australians, but, this, movie, is, awful, its, an, australian, story, yet, it, seems, like, it, was, set, in, america, also, ned, was, an, australian, yet, he, has, an, irish, accentit, is, the, worst, film, i, have, seen, in, a, long, time)","List(ned, akelli, is, such, an, import, stori, to, australian, but, thi, movi, is, aw, it, an, australian, stori, yet, it, seem, like, it, wa, set, in, america, also, ned, wa, an, australian, yet, he, ha, an, irish, accentit, is, the, worst, film, i, have, seen, in, a, long, time)",247,241,49,49
Protocol is an implausible movie whose only saving grace is that it stars Goldie Hawn along with a good cast of supporting actors. The story revolves around a ditzy cocktail waitress who becomes famous after inadvertently saving the life of an Arab dignitary. The story goes downhill halfway through the movie and Goldie's charm just doesn't save this movie. Unless you are a Goldie Hawn fan don't go out of your way to see this film.,0,protocol is an implausible movie whose only saving grace is that it stars goldie hawn along with a good cast of supporting actors the story revolves around a ditzy cocktail waitress who becomes famous after inadvertently saving the life of an arab dignitary the story goes downhill halfway through the movie and goldies charm just doesnt save this movie unless you are a goldie hawn fan dont go out of your way to see this film,"List(protocol, is, an, implausible, movie, whose, only, saving, grace, is, that, it, stars, goldie, hawn, along, with, a, good, cast, of, supporting, actors, the, story, revolves, around, a, ditzy, cocktail, waitress, who, becomes, famous, after, inadvertently, saving, the, life, of, an, arab, dignitary, the, story, goes, downhill, halfway, through, the, movie, and, goldies, charm, just, doesnt, save, this, movie, unless, you, are, a, goldie, hawn, fan, dont, go, out, of, your, way, to, see, this, film)","List(protocol, is, an, implaus, movi, whose, onli, save, grace, is, that, it, star, goldi, hawn, along, with, a, good, cast, of, support, actor, the, stori, revolv, around, a, ditzi, cocktail, waitress, who, becom, famou, after, inadvert, save, the, life, of, an, arab, dignitari, the, stori, goe, downhil, halfway, through, the, movi, and, goldi, charm, just, doesnt, save, thi, movi, unless, you, are, a, goldi, hawn, fan, dont, go, out, of, your, way, to, see, thi, film)",434,427,76,76
Outlandish premise that rates low on plausibility and unfortunately also struggles feebly to raise laughs or interest. Only Hawn's well-known charm allows it to skate by on very thin ice. Goldie's gotta be a contender for an actress who's done so much in her career with very little quality material at her disposal...,0,outlandish premise that rates low on plausibility and unfortunately also struggles feebly to raise laughs or interest only hawns wellknown charm allows it to skate by on very thin ice goldies gotta be a contender for an actress whos done so much in her career with very little quality material at her disposalbr br,"List(outlandish, premise, that, rates, low, on, plausibility, and, unfortunately, also, struggles, feebly, to, raise, laughs, or, interest, only, hawns, wellknown, charm, allows, it, to, skate, by, on, very, thin, ice, goldies, gotta, be, a, contender, for, an, actress, whos, done, so, much, in, her, career, with, very, little, quality, material, at, her, disposalbr, br)","List(outlandish, premis, that, rate, low, on, plausibl, and, unfortun, also, struggl, feebli, to, rais, laugh, or, interest, onli, hawn, wellknown, charm, allow, it, to, skate, by, on, veri, thin, ice, goldi, gotta, be, a, contend, for, an, actress, who, done, so, much, in, her, career, with, veri, littl, qualiti, materi, at, her, disposalbr, br)",330,315,54,54


# Conclusion

### Pandas
- **Use Cases**: Small to medium-sized datasets, local data analysis, quick prototyping.
- **Advantages**: Easy to use, rich functionality, excellent for in-memory operations.
- **Disadvantages**: Not suitable for very large datasets due to memory constraints.

### PySpark
- **Use Cases**: Large datasets, distributed data processing, big data analytics.
- **Advantages**: Scalable, can handle large datasets, integrates well with Hadoop.
- **Disadvantages**: More complex than Pandas, requires a Spark cluster.

### SQL
- **Use Cases**: Data querying, reporting, integration with BI tools.
- **Advantages**: Familiarity for users with SQL background, powerful for data retrieval and manipulation.
- **Disadvantages**: Limited to SQL operations, may require additional steps for complex data manipulations.