# Carregar e Limpar dados ausentes e duplicados

[Dataset utilizado](https://www.kaggle.com/datasets/mdhamani/goodreads-books-100k/data?select=GoodReads_100k_books.csv)

Carregando

In [1]:
import pandas as pd
import numpy as np
import csv


In [2]:
# Tentar carregar o dataset, ignorando as linhas problemáticas
df = pd.read_csv('/content/GoodReads_100k_books.csv', engine='python', on_bad_lines='skip')

# Verificar se todas as linhas foram carregadas corretamente
print(f"Número de linhas carregadas: {df.shape[0]}")

# Exibir as primeiras linhas do DataFrame
df.head()


Número de linhas carregadas: 45211


Unnamed: 0,author,bookformat,desc,genre,img,isbn,isbn13,link,pages,rating,reviews,title,totalratings
0,Laurence M. Hauptman,Hardcover,Reveals that several hundred thousand Indians ...,"History,Military History,Civil War,American Hi...",https://i.gr-assets.com/images/S/compressed.ph...,002914180X,9780000000000.0,https://goodreads.com/book/show/1001053.Betwee...,0,3.52,5,Between Two Fires: American Indians in the Civ...,33
1,"Charlotte Fiell,Emmanuelle Dirix",Paperback,Fashion Sourcebook - 1920s is the first book i...,"Couture,Fashion,Historical,Art,Nonfiction",https://i.gr-assets.com/images/S/compressed.ph...,1906863482,9780000000000.0,https://goodreads.com/book/show/10010552-fashi...,576,4.51,6,Fashion Sourcebook 1920s,41
2,Andy Anderson,Paperback,The seminal history and analysis of the Hungar...,"Politics,History",https://i.gr-assets.com/images/S/compressed.ph...,948984147,9780000000000.0,https://goodreads.com/book/show/1001077.Hungar...,124,4.15,2,Hungary 56,26
3,Carlotta R. Anderson,Hardcover,"""All-American Anarchist"" chronicles the life a...","Labor,History",https://i.gr-assets.com/images/S/compressed.ph...,814327079,9780000000000.0,https://goodreads.com/book/show/1001079.All_Am...,324,3.83,1,All-American Anarchist: Joseph A. Labadie and ...,6
4,Jean Leveille,,"Aujourdâ€™hui, lâ€™oiseau nous invite Ã sa ta...",,https://i.gr-assets.com/images/S/compressed.ph...,2761920813,,https://goodreads.com/book/show/10010880-les-o...,177,4.0,1,Les oiseaux gourmands,1


In [3]:
# Substituir strings vazias por NaN, para reconhecer como valor ausente
df = df.replace('', np.nan)

Limpando duplicatas

In [4]:
# Contar duplicatas
print("Duplicatas antes:", df.duplicated().sum())

# Remover duplicatas
df.drop_duplicates(inplace=True)

# Verificar o resultado
print("Duplicatas depois:", df.duplicated().sum())


Duplicatas antes: 0
Duplicatas depois: 0


Limpando valores ausentes

In [5]:
# Verificar quantos valores ausentes há em cada coluna
missing_values = df.isnull().sum()
print(missing_values)


author             0
bookformat      1466
desc            2995
genre           4229
img             1338
isbn            5629
isbn13          4329
link               0
pages              0
rating             0
reviews            0
title              0
totalratings       0
dtype: int64


In [6]:
# Remover as linhas com valores ausentes em 'title', 'rating', 'reviews' e 'totalratings'
df = df.dropna(subset=['title', 'rating', 'reviews', 'totalratings'])

# Preencher valores ausentes com valores padrão
df['bookformat'] = df['bookformat'].fillna('Desconhecido')
df['desc'] = df['desc'].fillna('Descrição não fornecida')
df['genre'] = df['genre'].fillna('Gênero desconhecido')
df['img'] = df['img'].fillna('Imagem não fornecida')
df['isbn'] = df['isbn'].fillna('ISBN não fornecido')
df['isbn13'] = df['isbn13'].fillna('ISBN13 não fornecido')
df['link'] = df['link'].fillna('Link não fornecido')
df['pages'] = df['pages'].fillna('Páginas desconhecidas')

# Verificar se ainda há valores ausentes
print(df.isnull().sum())

# Remover duplicatas
df = df.drop_duplicates()

# Verificar o número de linhas após as mudanças
print(f"Número de linhas após tratamento de valores ausentes: {df.shape[0]}")


author          0
bookformat      0
desc            0
genre           0
img             0
isbn            0
isbn13          0
link            0
pages           0
rating          0
reviews         0
title           0
totalratings    0
dtype: int64
Número de linhas após tratamento de valores ausentes: 45211


In [7]:
# Exibir informações do DataFrame para confirmar a limpeza
print(df.info())
# Exibir as primeiras linhas do DataFrame para uma revisão visual
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   author        45211 non-null  object 
 1   bookformat    45211 non-null  object 
 2   desc          45211 non-null  object 
 3   genre         45211 non-null  object 
 4   img           45211 non-null  object 
 5   isbn          45211 non-null  object 
 6   isbn13        45211 non-null  object 
 7   link          45211 non-null  object 
 8   pages         45211 non-null  int64  
 9   rating        45211 non-null  float64
 10  reviews       45211 non-null  int64  
 11  title         45211 non-null  object 
 12  totalratings  45211 non-null  int64  
dtypes: float64(1), int64(3), object(9)
memory usage: 4.5+ MB
None


Unnamed: 0,author,bookformat,desc,genre,img,isbn,isbn13,link,pages,rating,reviews,title,totalratings
0,Laurence M. Hauptman,Hardcover,Reveals that several hundred thousand Indians ...,"History,Military History,Civil War,American Hi...",https://i.gr-assets.com/images/S/compressed.ph...,002914180X,9.78E+12,https://goodreads.com/book/show/1001053.Betwee...,0,3.52,5,Between Two Fires: American Indians in the Civ...,33
1,"Charlotte Fiell,Emmanuelle Dirix",Paperback,Fashion Sourcebook - 1920s is the first book i...,"Couture,Fashion,Historical,Art,Nonfiction",https://i.gr-assets.com/images/S/compressed.ph...,1906863482,9.78E+12,https://goodreads.com/book/show/10010552-fashi...,576,4.51,6,Fashion Sourcebook 1920s,41
2,Andy Anderson,Paperback,The seminal history and analysis of the Hungar...,"Politics,History",https://i.gr-assets.com/images/S/compressed.ph...,948984147,9.78E+12,https://goodreads.com/book/show/1001077.Hungar...,124,4.15,2,Hungary 56,26
3,Carlotta R. Anderson,Hardcover,"""All-American Anarchist"" chronicles the life a...","Labor,History",https://i.gr-assets.com/images/S/compressed.ph...,814327079,9.78E+12,https://goodreads.com/book/show/1001079.All_Am...,324,3.83,1,All-American Anarchist: Joseph A. Labadie and ...,6
4,Jean Leveille,Desconhecido,"Aujourdâ€™hui, lâ€™oiseau nous invite Ã sa ta...",Gênero desconhecido,https://i.gr-assets.com/images/S/compressed.ph...,2761920813,ISBN13 não fornecido,https://goodreads.com/book/show/10010880-les-o...,177,4.0,1,Les oiseaux gourmands,1
