## TARGET

Using messy data: 
* **Import** the data using Pandas.
* **Examine** the data for potential issues. 
* Use at least 8 of the **cleaning and manipulation** methods you have learned on the data.  
* Produce a Jupyter Notebook that **shows the steps you took and the code you used** to clean and transform your data set.
* **Export** a clean CSV version of your data using Pandas.


### 1 Import

In [79]:
import pandas as pd

Data set imported from https://www.kaggle.com/worldbank/world-development-indicators#database.sqlite. 


`*` Awesomedata sources are said to be more realistic, but I spent too much time trying to get any that could work for this. 

In [80]:
data = pd.read_csv('Country.csv')

### 2 First Overall Examination

In [81]:
data.head(4)

Unnamed: 0,CountryCode,ShortName,TableName,LongName,Alpha2Code,CurrencyUnit,SpecialNotes,Region,IncomeGroup,Wb2Code,...,GovernmentAccountingConcept,ImfDataDisseminationStandard,LatestPopulationCensus,LatestHouseholdSurvey,SourceOfMostRecentIncomeAndExpenditureData,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData
0,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,Fiscal year end: March 20; reporting period fo...,South Asia,Low income,AF,...,Consolidated central government,General Data Dissemination System (GDDS),1979,"Multiple Indicator Cluster Survey (MICS), 2010/11","Integrated household survey (IHS), 2008",,2013/14,,2013.0,2000.0
1,ALB,Albania,Albania,Republic of Albania,AL,Albanian lek,,Europe & Central Asia,Upper middle income,AL,...,Budgetary central government,General Data Dissemination System (GDDS),2011,"Demographic and Health Survey (DHS), 2008/09",Living Standards Measurement Study Survey (LSM...,Yes,2012,2011.0,2013.0,2006.0
2,DZA,Algeria,Algeria,People's Democratic Republic of Algeria,DZ,Algerian dinar,,Middle East & North Africa,Upper middle income,DZ,...,Budgetary central government,General Data Dissemination System (GDDS),2008,"Multiple Indicator Cluster Survey (MICS), 2012","Integrated household survey (IHS), 1995",,,2010.0,2013.0,2001.0
3,ASM,American Samoa,American Samoa,American Samoa,AS,U.S. dollar,,East Asia & Pacific,Upper middle income,AS,...,,,2010,,,Yes,2007,,,


Another.ipynb has been created that displays the entire (though still truncated) table separatedly for convenience

Data doesn't look too bad at first glance, but there are some potential problems: 

Field "LatestAgriculturalCensus", contains registers that show two years combined, instead of one, separated with a slash. They are going to prevent a fluid analysis. Moreover, some fields also include a description (most times of the source) added to the year. Same problem is found in "NationalAccountsBaseYear" , "NationalAccountsReferenceYear", "PppSurveyYear", "LatestPopulationCensus",  "LatestHouseholdSurvey" (although this one atleas displays both source and year consistently), "SourceOfMostRecentIncomeAndExpenditureData", "VitalRegistrationComplete", "LatestAgriculturalCensus", "LatestIndustrialData" and "LatestTradeData"

Some "Region" fields contain notes instead of the region. 

"AlternativeConversionFactor", contains special character that can't even be displayed now. 

Also some have missing values and others have NaN. 

### 3. Cleaning

#### **Missing Values**
Columns with null values (**`isnull` method**), sorted by number of nulls (**`sort_values`**):

In [82]:
null_cols = data.isnull().sum() # isnull es ds, sum retorna serie 

In [83]:
null_cols[null_cols > 0].sort_values()

Wb2Code                                         1
Alpha2Code                                      3
CurrencyUnit                                   33
Region                                         33
IncomeGroup                                    33
SystemOfNationalAccounts                       33
LatestPopulationCensus                         34
NationalAccountsBaseYear                       42
SystemOfTrade                                  47
SnaPriceValuation                              49
PppSurveyYear                                  56
LatestTradeData                                61
ImfDataDisseminationStandard                   64
BalanceOfPaymentsManualInUse                   66
LatestWaterWithdrawalData                      67
SpecialNotes                                   83
GovernmentAccountingConcept                    86
SourceOfMostRecentIncomeAndExpenditureData     89
LatestHouseholdSurvey                         100
LendingCategory                               103


Having the data set a total of 247 registers, there are clear candidates for removal. We'll do the cut at 105, which is not only  below half of the possible rows (therefore we leave the ones whose non-nulls are a majority), but also thereś a gap to the next field, which is significantly bigger than between other closer fields. 

Using the **`drop` method**: 

In [84]:
drop_cols = list(null_cols[null_cols > 105].index)
data = data.drop(drop_cols, axis=1)

CurrencyUnit, Region, IncomeGroup, SystemOfNationalAccounts, have the same number of nulls. Could that point to a common cause or a relation of some sort? Let's **subset** the data to find out:

In [87]:
null_Region = data[(data['Region'].isnull()==True)]
null_Region = null_Region[['CurrencyUnit', 'IncomeGroup', 'SystemOfNationalAccounts', 'LatestPopulationCensus', 'SystemOfTrade','ImfDataDisseminationStandard','TableName', 'Region']]
null_Region

Unnamed: 0,CurrencyUnit,IncomeGroup,SystemOfNationalAccounts,LatestPopulationCensus,SystemOfTrade,ImfDataDisseminationStandard,TableName,Region
7,,,,,,,Arab World,
35,,,,,,,Caribbean small states,
38,,,,,,,Central Europe and the Baltics,
59,,,,,,,East Asia & Pacific (all income levels),
60,,,,,,,East Asia & Pacific,
68,,,,,,,Euro area,
69,,,,,,,Europe & Central Asia (all income levels),
70,,,,,,,Europe & Central Asia,
71,,,,,,,European Union,
75,,,,,,,Fragile and conflict affected situations,


By looking at the list, we see they are all aggregated countries that have been set in the  same column as individual countries. Further looking at the data, shows that all countries aggregated by region or income have a NaN Region. And as they are not countries, it is expected that they don't have a particular census, system of "national" accounts, and many other fields. Also the Region or InconmeGroup is already in TableName, so having a NaN in those cols, might have been a way to identify them. Yet we can aggregate those countries by Region and IncomeGroup, so this is duplicate data that we remove:


In [86]:
null_Region = null_Region.drop(nullRegion, axis=0)
null_Region

NameError: name 'nullRegion' is not defined

In [None]:
null_Region.inMybrain = null_Region(NaN, Nan, Nan, zzzz)
ca