## Análise de Reviews de Filmes com Pandas

Este notebook utiliza o dataset `IMDB Dataset of 50K Movie Reviews` para analisar e manipular reviews de filmes usando funções avançadas do Pandas

O objetivo é realizar transformações textuais, análises exploratórias e responder a diversas perguntas básicas nos dados.


## Importando Bibliotecas e Carregando os Dados

In [6]:
import pandas as pd

# Carregar os dados
file_path = 'data/IMDB Dataset.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Exploração Inicial dos Dados

In [None]:
df['review'][0] #vamos ver o que tem na primeira linha da coluna review

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [13]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   review         50000 non-null  object
 1   sentiment      50000 non-null  object
 2   review_length  50000 non-null  int64 
 3   review_lower   50000 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.5+ MB


Unnamed: 0,review_length
count,50000.0
mean,1309.43102
std,989.728014
min,32.0
25%,699.0
50%,970.0
75%,1590.25
max,13704.0


## Qual a distribuição das classes de sentimento?

In [3]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

# Limpeza e Transformação de Dados

#### Criar uma nova coluna com o comprimento de cada review

In [7]:
df['review_length'] = df['review'].str.len()
df[['review', 'review_length']].head()

Unnamed: 0,review,review_length
0,One of the other reviewers has mentioned that ...,1761
1,A wonderful little production. <br /><br />The...,998
2,I thought this was a wonderful way to spend ti...,926
3,Basically there's a family where a little boy ...,748
4,"Petter Mattei's ""Love in the Time of Money"" is...",1317


#### Converter todas as reviews para minúsculas

In [8]:
df['review_lower'] = df['review'].str.lower()
df[['review', 'review_lower']].head()

Unnamed: 0,review,review_lower
0,One of the other reviewers has mentioned that ...,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,a wonderful little production. <br /><br />the...
2,I thought this was a wonderful way to spend ti...,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...","petter mattei's ""love in the time of money"" is..."


#### Criar uma nova coluna indicando se a review contém a palavra 'amazing'

In [14]:
df['contains_amazing'] = df['review'].str.contains('amazing', case=False)
df[['review', 'contains_amazing']].head()

Unnamed: 0,review,contains_amazing
0,One of the other reviewers has mentioned that ...,False
1,A wonderful little production. <br /><br />The...,False
2,I thought this was a wonderful way to spend ti...,False
3,Basically there's a family where a little boy ...,False
4,"Petter Mattei's ""Love in the Time of Money"" is...",False


#### Filtrar reviews que mencionam 'bad'

In [None]:
df_bad = df[df['review'].str.contains('bad', case=False)]
print(len(df_bad) / len(df)) # 25% dos reviews contém a palavra 'bad'
df_bad.head()


0.25408


Unnamed: 0,review,sentiment,review_length,review_lower,contains_amazing
7,"This show was an amazing, fresh & innovative i...",negative,934,"this show was an amazing, fresh & innovative i...",True
8,Encouraged by the positive comments about this...,negative,681,encouraged by the positive comments about this...,False
12,So im not a big fan of Boll's work but then ag...,negative,2227,so im not a big fan of boll's work but then ag...,False
14,This a fantastic movie of three prisoners who ...,positive,275,this a fantastic movie of three prisoners who ...,False
15,"Kind of drawn in by the erotic scenes, only to...",negative,830,"kind of drawn in by the erotic scenes, only to...",False


#### Criar uma coluna categorizando as reviews pelo tamanho

Curta - até 500 caracteres
Média - de 500 até 1000 caracteres
Longa - mais de 1000 caracteres

In [None]:
df['review_category'] = pd.cut(df['review_length'], bins=[0, 500, 1000, float('inf')], labels=['Curta', 'Média', 'Longa'])
# o bins é uma lista com os limites dos intervalos. O primeiro intervalo é [0, 500], o segundo é [500, 1000] e o terceiro é [1000, infinito]

df[['review_length', 'review_category']].head()

Unnamed: 0,review_length,review_category
0,1761,Longa
1,998,Média
2,926,Média
3,748,Média
4,1317,Longa


In [24]:
df['review_category'].value_counts()

review_category
Longa    24001
Média    21016
Curta     4983
Name: count, dtype: int64

## Questões

### Quais reviews tem mais de 1000 caracteres?

In [26]:
df[df['review_length'] > 1000]

Unnamed: 0,review,sentiment,review_length,review_lower,contains_amazing,review_category
0,One of the other reviewers has mentioned that ...,positive,1761,one of the other reviewers has mentioned that ...,False,Longa
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1317,"petter mattei's ""love in the time of money"" is...",False,Longa
12,So im not a big fan of Boll's work but then ag...,negative,2227,so im not a big fan of boll's work but then ag...,False,Longa
17,This movie made it into one of my top 10 most ...,negative,1322,this movie made it into one of my top 10 most ...,False,Longa
20,After the success of Die Hard and it's sequels...,positive,1813,after the success of die hard and it's sequels...,False,Longa
...,...,...,...,...,...,...
49991,"Les Visiteurs, the first movie about the medie...",negative,1498,"les visiteurs, the first movie about the medie...",False,Longa
49993,Robert Colomb has two full-time jobs. He's kno...,negative,2717,robert colomb has two full-time jobs. he's kno...,False,Longa
49995,I thought this movie did a down right good job...,positive,1008,i thought this movie did a down right good job...,False,Longa
49997,I am a Catholic taught in parochial elementary...,negative,1280,i am a catholic taught in parochial elementary...,False,Longa


In [30]:
#Quantas reviews contem a palavra 'excellent'

len(df[df['review'].str.contains('excellent')]) 

3409

In [32]:
#Qual a média de comprimento das reviews positivas e negativas

#groupby

df.groupby('sentiment')['review_length'].mean()


sentiment
negative    1294.06436
positive    1324.79768
Name: review_length, dtype: float64

In [36]:
#Filtre apenas as reviews negativas que mencionam a palavra 'boring'

aux = df[(df['sentiment'] == 'negative') & (df['review'].str.contains('boring'))]
aux

Unnamed: 0,review,sentiment,review_length,review_lower,contains_amazing,review_category,contains_excellent
8,Encouraged by the positive comments about this...,negative,681,encouraged by the positive comments about this...,False,Média,False
23,"First of all, let's get a few things straight ...",negative,1767,"first of all, let's get a few things straight ...",False,Longa,False
34,"I watched this film not really expecting much,...",negative,1300,"i watched this film not really expecting much,...",False,Longa,False
63,"Besides being boring, the scenes were oppressi...",negative,267,"besides being boring, the scenes were oppressi...",False,Curta,False
107,While Star Trek the Motion Picture was mostly ...,negative,1642,while star trek the motion picture was mostly ...,False,Longa,False
...,...,...,...,...,...,...,...
49913,Why does this movie fall WELL below standards?...,negative,1438,why does this movie fall well below standards?...,False,Longa,False
49939,Depending entirely on your own personal state ...,negative,1927,depending entirely on your own personal state ...,False,Longa,False
49946,One of the greatest lessons I ever had in how ...,negative,2901,one of the greatest lessons i ever had in how ...,False,Longa,False
49948,"It is the early morning of our discontent, and...",negative,5847,"it is the early morning of our discontent, and...",False,Longa,False


In [None]:
aux = aux.reset_index(drop=True) #resetar o index
aux['review'][0]  #vamos ver o que tem na primeira linha da coluna review

"Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. <br /><br />The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only."

In [None]:
# Qual a proporção de reviews curtas, médias e longas?

df['review_category'].value_counts(normalize=True) #normalize=True para obter a proporção

review_category
Longa    0.48002
Média    0.42032
Curta    0.09966
Name: proportion, dtype: float64

In [46]:
#Quais são as 5 reviews mais longas no dataset?

#df.nlargest(5, 'review_length')['review']

#fazer um sort do review_length em ordem decrescente e pegar os 5 primeiros

sub_df = df.sort_values('review_length', ascending=False).head()
sub_df


Unnamed: 0,review,sentiment,review_length,review_lower,contains_amazing,review_category,contains_excellent
31481,Match 1: Tag Team Table Match Bubba Ray and Sp...,positive,13704,match 1: tag team table match bubba ray and sp...,False,Longa,False
40521,There's a sign on The Lost Highway that says:<...,positive,12988,there's a sign on the lost highway that says:<...,False,Longa,True
31240,"(Some spoilers included:)<br /><br />Although,...",positive,12930,"(some spoilers included:)<br /><br />although,...",False,Longa,False
31436,"Back in the mid/late 80s, an OAV anime by titl...",positive,12129,"back in the mid/late 80s, an oav anime by titl...",False,Longa,False
5708,**Attention Spoilers**<br /><br />First of all...,positive,10363,**attention spoilers**<br /><br />first of all...,True,Longa,False


In [None]:
#Quantas reviews mencionam ao mesmo tempo 'amazing' e 'boring'?

len(df[(df['review_lower'].str.contains('amazing')) & (df['review_lower'].str.contains('boring'))])
#precisamos pegar o lower case pois podemos perder reviews que contém 'Amazing' ou 'Boring'



141

In [54]:
#df [ df['contains_amazing'] & df['review'].str.contains('boring')]
len(df [ df['contains_amazing'] & df['review'].str.contains('boring')])

138

In [62]:
#Existe alguma relação entre o tamanho da review e a presença da palavra 'amazing' ?

# 1 a coluna a ser agrupada
# 1 coluna vai ser utilizada com a coluna de critério
# 1 métrica estatística de critério

## df.groupby(coluna a ser agrupada)[coluna de critério].métrica()

df.groupby('contains_amazing')['review_length'].mean()

contains_amazing
False    1292.055091
True     1642.517547
Name: review_length, dtype: float64

In [None]:
#Existe diferença no tamanho médio das reviews entre aquelas que contêm a palavra 'movie' e as que não contêm?

df['contains_movie'] = df['review'].str.contains('movie' , case=False) #case=False para não diferenciar maiúsculas de minúsculas

df.groupby('contains_movie')['review_length'].mean()



contains_movie
False    1256.507726
True     1338.026000
Name: review_length, dtype: float64

In [66]:
D = df.to_dict() #transforma o dataframe em um dicionário
D['review'][0] #vamos ver o que tem na primeira linha da coluna review

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [None]:
df.to_csv('reviews_imbd_modificadas.csv' , index=False) #salvar o dataframe em um arquivo csv sem o index

In [70]:
pd.read_csv('reviews_imbd_modificadas.csv') #ler o arquivo csv

Unnamed: 0,review,sentiment,review_length,review_lower,contains_amazing,review_category,contains_excellent,contains_movie
0,One of the other reviewers has mentioned that ...,positive,1761,one of the other reviewers has mentioned that ...,False,Longa,False,False
1,A wonderful little production. <br /><br />The...,positive,998,a wonderful little production. <br /><br />the...,False,Média,False,False
2,I thought this was a wonderful way to spend ti...,positive,926,i thought this was a wonderful way to spend ti...,False,Média,False,False
3,Basically there's a family where a little boy ...,negative,748,basically there's a family where a little boy ...,False,Média,False,True
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1317,"petter mattei's ""love in the time of money"" is...",False,Longa,False,True
...,...,...,...,...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,1008,i thought this movie did a down right good job...,False,Longa,False,True
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,642,"bad plot, bad dialogue, bad acting, idiotic di...",False,Média,False,False
49997,I am a Catholic taught in parochial elementary...,negative,1280,i am a catholic taught in parochial elementary...,False,Longa,False,True
49998,I'm going to have to disagree with the previou...,negative,1234,i'm going to have to disagree with the previou...,False,Longa,False,False


In [73]:
import json


stats = {
    'total_reviews': len(df),
    'avg_review_length': df['review_length'].mean(),
    'percent_contains_amazing': df['contains_amazing'].mean(),
}

with open('reviews_imdb_completo.json', 'w') as f:
    json.dump(stats, f) #salvar o dicionário em um arquivo json

In [74]:
#abrir o json

with open('reviews_imdb_completo.json', 'r') as f:
    stats = json.load(f)

stats

{'total_reviews': 50000,
 'avg_review_length': 1309.43102,
 'percent_contains_amazing': 0.04958}