## Limpando o Dataset - WIKIPÉDIA

- Usando o web scraping, conseguimos retirar algumas informações importantes, para um furutra projeto de Machine Learning. Agora por meio do data cleaning, iremos limpar os dados que aind apossuem resqeucios da web, nõa podendo ser trabalhados para a utilização de algoritmos de ML.

- Irei portando limpar os dados para um modo de visualização mais claro, para trazer valor ao dataset. Aqui, entenderemos as colunas do conjunto de dados e tentaremos remover cadeias e valores desnecessários que não agregam valor ao conjunto de dados.

In [1]:
#Lendo a arquivo .csv retirado do Wikipedia.

import re
import numpy as np
import pandas as pd



In [2]:
dataset = pd.read_csv("Dataset.csv")

In [3]:
dataset.head(100)

Unnamed: 0,Country/Territory,US$,Total Area,Percentage Water,Total Nominal GDP,Per Capita GDP
0,Luxembourg,113196,"2,586.4 km2 (998.6 sq mi) (167th)",0.60%,$69.453 billion[3] (69th),"$113,196[3] (1st)"
1,Switzerland,83716,"41,285 km2 (15,940 sq mi) (132nd)",4.2,$704 billion[7] (20th),"$82,950[7] (2nd)"
2,Macau,81151,115.3 km2 (44.5 sq mi),73.7,$54.545 billion[3] (83rd),"$81,728[3] (3rd)"
3,Norway,77975,"385,207 km2 (148,729 sq mi)[7] (67thb)",5.7c,$443 billion[9] (22nd),"$82,711[9] (3rd)"
4,Ireland,77771,"70,273 km2 (27,133 sq mi) (118th)",2.00,$384.940 billion[5] (32nd),"$77,771[5] (4th)"
...,...,...,...,...,...,...
95,Jamaica,5460,"10,991 km2 (4,244 sq mi) (160th)",1.5,$15.424 billion[8] (119th),"$5,393[8] (95th)"
96,Albania,5372,"28,748 km2 (11,100 sq mi) (140th)",4.7,$16.753 billion[3],"$5,847[3]"
97,Guyana,5252,"214,970 km2 (83,000 sq mi) (83rd)",8.4,$8.065 billion[6],"$10,249[6]"
98,Libya,5019,"1,759,541 km2 (679,363 sq mi) (16th)","6,871,287[3] (108th)",$51.330 billion[4] (98),"$7,803[4]"


- Substituindo o nome dos headers

In [4]:
dataset.rename(columns={"Country/Territory": "Country"}, inplace = True)
dataset.rename(columns={"Total Area": "Total Area (km2)"}, inplace = True)

In [5]:
dataset.head(5)

Unnamed: 0,Country,US$,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Luxembourg,113196,"2,586.4 km2 (998.6 sq mi) (167th)",0.60%,$69.453 billion[3] (69th),"$113,196[3] (1st)"
1,Switzerland,83716,"41,285 km2 (15,940 sq mi) (132nd)",4.2,$704 billion[7] (20th),"$82,950[7] (2nd)"
2,Macau,81151,115.3 km2 (44.5 sq mi),73.7,$54.545 billion[3] (83rd),"$81,728[3] (3rd)"
3,Norway,77975,"385,207 km2 (148,729 sq mi)[7] (67thb)",5.7c,$443 billion[9] (22nd),"$82,711[9] (3rd)"
4,Ireland,77771,"70,273 km2 (27,133 sq mi) (118th)",2.00,$384.940 billion[5] (32nd),"$77,771[5] (4th)"


- Vemos que quase todas as colunas têm células que possuem dados entre parênteses e colchetes, o que não é necessário. Assim, podemos primeiro remover todas as parênteses, colchetes e o conteúdo dentro deles.

In [6]:
dataset.columns

Index(['Country', 'US$', 'Total Area (km2)', 'Percentage Water',
       'Total Nominal GDP', 'Per Capita GDP'],
      dtype='object')

In [6]:
for column in dataset.columns:
    dataset[column] = dataset[column].str.replace(r"\(.*\)", "")
    dataset[column] = dataset[column].str.replace(r"\[.*\]", "")

In [7]:
dataset.head(5)

Unnamed: 0,Country,US$,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Luxembourg,113196,"2,586.4 km2",0.60%,$69.453 billion,"$113,196"
1,Switzerland,83716,"41,285 km2",4.2,$704 billion,"$82,950"
2,Macau,81151,115.3 km2,73.7,$54.545 billion,"$81,728"
3,Norway,77975,"385,207 km2",5.7c,$443 billion,"$82,711"
4,Ireland,77771,"70,273 km2",2.00,$384.940 billion,"$77,771"


In [8]:
#Agora remova a percentagem da coluna 'Percentage Water'

dataset['Percentage Water'] = dataset['Percentage Water'].str.strip('%')
dataset['Percentage Water'] = dataset['Percentage Water'].str.strip()
dataset['Percentage Water'] = dataset['Percentage Water'].str.strip('c')

In [9]:
dataset['Percentage Water']

0            0.60
1             4.2
2            73.7
3             5.7
4            2.00
          ...    
175          0.02
176          20.6
177          0.14
178            10
179    12,778,250
Name: Percentage Water, Length: 180, dtype: object

In [10]:
dataset['Total Area (km2)'] 

0        2,586.4 km2 
1         41,285 km2 
2          115.3 km2 
3        385,207 km2 
4         70,273 km2 
            ...      
175    1,267,000 km2 
176      118,484 km2 
177      117,600 km2 
178       27,834 km2 
179      619,745 km2 
Name: Total Area (km2), Length: 180, dtype: object

In [39]:
dataset


Unnamed: 0,Country,US$,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Luxembourg,113196,"2,586.4 km2",0.60,$69.453 billion,"$113,196"
1,Switzerland,83716,"41,285 km2",4.2,$704 billion,"$82,950"
2,Macau,81151,115.3 km2,73.7,$54.545 billion,"$81,728"
3,Norway,77975,"385,207 km2",5.7,$443 billion,"$82,711"
4,Ireland,77771,"70,273 km2",2.00,$384.940 billion,"$77,771"
...,...,...,...,...,...,...
175,Niger,405,"1,267,000 km2",0.02,$9.869 billion,$510
176,Malawi,370,"118,484 km2",20.6,$7.436 billion,$367
177,Eritrea,342,"117,600 km2",0.14,$8.116 billion,"$1,295"
178,Burundi,309,"27,834 km2",10,$3.573 billion,$310


In [43]:
#setando para trabalhar com a coluna country

#df.set_index("Country")

Unnamed: 0_level_0,US$,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Luxembourg,113196,"2,586.4 km2",0.60,$69.453 billion,"$113,196"
Switzerland,83716,"41,285 km2",4.2,$704 billion,"$82,950"
Macau,81151,115.3 km2,73.7,$54.545 billion,"$81,728"
Norway,77975,"385,207 km2",5.7,$443 billion,"$82,711"
Ireland,77771,"70,273 km2",2.00,$384.940 billion,"$77,771"
...,...,...,...,...,...
Niger,405,"1,267,000 km2",0.02,$9.869 billion,$510
Malawi,370,"118,484 km2",20.6,$7.436 billion,$367
Eritrea,342,"117,600 km2",0.14,$8.116 billion,"$1,295"
Burundi,309,"27,834 km2",10,$3.573 billion,$310


In [11]:
area = dataset.iloc[x]['Total Area (km2)']
area

NameError: name 'x' is not defined

Limpando os dados da area.

In [12]:
dataset['Total Area (km2)'] = dataset['Total Area (km2)'].str.replace(',', '.')
dataset['Total Area (km2)'] = dataset['Total Area (km2)'].str.replace('.', '')

In [13]:


for x in range(len(dataset['Total Area (km2)'])): #inspeciona todo o dataset
    area = dataset.iloc[x]['Total Area (km2)'] # seleciona cada linha do df
    if ('sq\xa0mi' in area): #Substitue o km2
        area = area.split('-')[0] #Limpa os espaços em branco
        area = re.sub(r'[^0-9.]+', '', area)#Substitue os re
    else:
        area = area.split('-')[0]
        area = re.sub(r'[^0-9.]+', '', area)
        area = int(float(area))
    dataset.iloc[x]['Total Area (km2)'] = area

dataset.head(5)

Unnamed: 0,Country,US$,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
0,Luxembourg,113196,258642,0.6,$69.453 billion,"$113,196"
1,Switzerland,83716,412852,4.2,$704 billion,"$82,950"
2,Macau,81151,11532,73.7,$54.545 billion,"$81,728"
3,Norway,77975,3852072,5.7,$443 billion,"$82,711"
4,Ireland,77771,702732,2.0,$384.940 billion,"$77,771"


- Agora iniciaremos a limpezas dos dados da coluna 'Percentage Water'

In [14]:
dataset['Percentage Water'].head(50)

0            0.60
1             4.2
2            73.7
3             5.7
4            2.00
5             0.8
6             2.7
7            6.97
8       5,703,600
9            0.76
10          19.26
11            8.7
12            1.7
13           59.8
14             10
15              0
16     83,166,711
17           8.92
18            6.5
19            2.1
20    551,695 km2
21           1.34
22    125,930,000
23            1.6
24             28
25            2.4
26            1.6
27            0.3
28          0.001
29           1.04
30     negligible
31            8.6
32            0.7
33     negligible
34     negligible
35     23,780,452
36           4.45
37              2
38            0.5
39            0.7
40         0.8669
41         0.0789
42           1.35
43     Negligible
44          1.57%
45     negligible
46     Negligible
47     negligible
48            3.7
49     negligible
Name: Percentage Water, dtype: object

- Como se pode ver os dados contém algumas linha 'neglibes', levando assim para não deletar as mesas devemso substituir por 0.0.

- Para as colunas em que o valor é superior a 100, os valores reais estavam ausentes e outro conteúdo foi lido. Portanto, devemos remover essas linhas devido à falta de informações.

In [15]:
dataset['Percentage Water'] = dataset['Percentage Water'].replace('negligible', '0.0')
dataset['Percentage Water'] = dataset['Percentage Water'].replace('Negligible', '0.0')
dataset['Percentage Water'] = dataset['Percentage Water'].str.replace(r'[^0-9.]+', '')

dataset['Percentage Water'] = dataset['Percentage Water'].replace(',', '.')

#Tranformando a coluna em float, com o seguinte algoritmo:

dataset['Percentage Water'] = pd.to_numeric(dataset['Percentage Water'], errors='coerce')

#Sustitue caso o resultado apareça menor que 100.

dataset = dataset[dataset['Percentage Water'].astype(float) <= 100]


In [44]:
print (dataset.dtypes)

Country               object
US$                   object
Total Area (km2)      object
Percentage Water     float64
Total Nominal GDP     object
Per Capita GDP        object
dtype: object


In [16]:
dataset['Percentage Water'].head(50)

0      0.6000
1      4.2000
2     73.7000
3      5.7000
4      2.0000
5      0.8000
6      2.7000
7      6.9700
9      0.7600
10    19.2600
11     8.7000
12     1.7000
13    59.8000
14    10.0000
15     0.0000
17     8.9200
18     6.5000
19     2.1000
21     1.3400
23     1.6000
24    28.0000
25     2.4000
26     1.6000
27     0.3000
28     0.0010
29     1.0400
30     0.0000
31     8.6000
32     0.7000
33     0.0000
34     0.0000
36     4.4500
37     2.0000
38     0.5000
39     0.7000
40     0.8669
41     0.0789
42     1.3500
43     0.0000
44     1.5700
45     0.0000
46     0.0000
47     0.0000
48     3.7000
49     0.0000
50     1.5000
51     0.0000
52     0.0000
53     2.9000
54     1.0700
Name: Percentage Water, dtype: float64

In [46]:
dataset.tail(50)

Unnamed: 0,Country,US$,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
125,Vietnam,2740,3312122,6.38,$261.637 billion,"$2,740"
126,Laos,2670,2379552,2.0,US$20.153 billion,"US$2,670"
128,Venezuela,2547,9164452,3.2,$70.140 billion,"$2,548"
129,"Congo, Republic of the",2534,3420002,3.3,$11.162 billion,"$2,444"
130,East Timor,2262,150072,0.0,$3.145 billion,"$2,422"
131,Solomon Islands,2246,284002,3.2,$1.511 billion,"$2,357"
132,Ghana,2223,2395672,4.61,$69.757 billion,"$2,266"
133,Nigeria,2222,9237692,1.4,$504.57 billion,"$2,465"
134,India,2171,32872632,9.6,$3.202 trillion,"$2,338"
135,Kenya,1997,5803672,2.3,$109.116 billion,"$2,151"


- O PIB total inclui os valores na forma de trilhões, bilhões e milhões. Podemos remover $ e converter as palavras em números.

In [17]:
dataset['Total Nominal GDP'] = dataset['Total Nominal GDP'].str.replace('$', '')
dataset['Total Nominal GDP'] = dataset['Total Nominal GDP'].str.replace('US', '')

In [48]:
dataset['Per Capita GDP'] = dataset['Per Capita GDP'].str.replace('$', '')
dataset['Per Capita GDP'] = dataset['Per Capita GDP'].str.replace('US', '')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [18]:
dataset.tail(20)

Unnamed: 0,Country,US$,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
157,Tajikistan,877,1431002,1.8,7.350 billion,$807
158,Chad,861,12840002,1.9,11 billion,$890
159,Zimbabwe,859,3907572,1.0,22.290 billion,"$1,424"
160,Rwanda,824,263382,5.3,10.211 billion,$830
161,Guinea-Bissau,786,361252,22.4,1.480 billion,$851
162,Haiti,784,277502,0.7,7.897 billion,$719
163,Uganda,770,2410382,15.39,30.765 billion,$833
164,"Gambia, The",755,106892,11.5,1.038 billion,$488
165,Burkina Faso,717,2742002,0.146,16.226 billion,$792
167,Liberia,703,1113692,13.514,3.221 billion,$704


In [27]:
#dataset['Total Nominal GDP'] = dataset['Total Nominal GDP'].str.replace('$', '')
#dataset['Total Nominal GDP'] = dataset['Total Nominal GDP'].str.replace('US', '')

for x in range(len(dataset['Total Nominal GDP'])):
    gdp = dataset.iloc[x]['Total Nominal GDP']
    if ('trillion' in dataset.iloc[x]['Total Nominal GDP']):
        gdp = re.sub(r'[^0-9.]+', '', gdp)
        gdp = int(float(gdp) * 1000000000000)
    elif ('billion' in dataset.iloc[x]['Total Nominal GDP']):
        gdp = re.sub(r'[^0-9.]+', '', gdp)
        gdp = int(float(gdp) * 1000000000)
    elif ('million' in dataset.iloc[x]['Total Nominal GDP']):
        gdp = re.sub(r'[^0-9.]+', '', gdp)
        gdp = int(float(gdp) * 1000000)
    else:
        gdp = int(re.sub(r'[^0-9.]+', '', gdp))
    dataset.iloc[x]['Total Nominal GDP'] = gdp

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
dataset['Total Nominal GDP'] = dataset['Total Nominal GDP'].str.extract(r"(\d)", expand = False)
# replace nan (previously zero) with 2
dataset['Total Nominal GDP'].fillna(2, inplace = True)
# convert to int
dataset['Total Nominal GDP'] = dataset['Total Nominal GDP'].astype(int)
dataset.tail()

In [51]:
dataset.tail(50)

Unnamed: 0,Country,US$,Total Area (km2),Percentage Water,Total Nominal GDP,Per Capita GDP
125,Vietnam,2740,3312122,6.38,261.637 billion,2740
126,Laos,2670,2379552,2.0,20.153 billion,2670
128,Venezuela,2547,9164452,3.2,70.140 billion,2548
129,"Congo, Republic of the",2534,3420002,3.3,11.162 billion,2444
130,East Timor,2262,150072,0.0,3.145 billion,2422
131,Solomon Islands,2246,284002,3.2,1.511 billion,2357
132,Ghana,2223,2395672,4.61,69.757 billion,2266
133,Nigeria,2222,9237692,1.4,504.57 billion,2465
134,India,2171,32872632,9.6,3.202 trillion,2338
135,Kenya,1997,5803672,2.3,109.116 billion,2151


In [356]:
dataset.to_csv("Final_dataset.csv", index = False)