# Data Formatting (numeric and strings)

This is a process where we make sure we have the right data type.

It is important to realize that while you do formatting you may need some messiness in the values may appear that again again require a cleaning of the column.


# Numeric case

Let's start by reading in some data:

In [None]:
from IPython.display import IFrame  
wikiLink1="https://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts#Deaths_by_country"
IFrame(wikiLink1, width=900, height=500)

That table is showing the top 20 countries with deaths per year from 2016 to 2021. Let's bring it:

In [None]:
import pandas as pd
badCountries = pd.read_html(wikiLink1,flavor='bs4',
                        attrs = {'class': 'wikitable sortable'})
len(badCountries)

The table of interes is here:

In [None]:
theTable=badCountries[4]
theTable

The first thing to notice is the column names, which is a multi index:

In [None]:
theTable.columns

What about this new names?

In [None]:
# concatenate elements

newNames=["_".join((b,a)) for a,b in theTable.columns]
newNames

In [None]:
# looks better?
betterNames=[n.split('[')[0] for n in newNames]
betterNames

In [None]:
#Then:
theTable.columns=betterNames
theTable

We do not have multi index anymore. Let's get rid of the first column too: 

In [None]:
theTable.drop(columns=[theTable.columns[0]], inplace=True)
theTable.info()

Notice that the last column *Deaths_2021* has been recognised as a number; while the others have not.

Let's explore attention to 2016:

In [None]:
theTable.Deaths_2016[theTable.Deaths_2016.str.contains(r'\D')]

Let's clean:

In [None]:
byeChars=r'\[\d+\]|\*|,'
theTable.Deaths_2016=theTable.Deaths_2016.str.replace(pat=byeChars,
                                                      repl="",
                                                      regex=True)
theTable

We have a clean version, but:

In [None]:
theTable.info()

We need a simple step:

In [None]:
theTable.Deaths_2016=pd.to_numeric(theTable.Deaths_2016)
theTable.info()

Statistics can be obtained when data is in the rigth type:

In [None]:
theTable.describe()

As you see, the other columns were 'rejected'. Of course, you can force the fucntion:

In [None]:
theTable.describe(include='all')

The function **to_numeric** worked because we had a clean column, if that is not the case you will see this:

In [None]:
#pd.to_numeric(theTable.Deaths_2017)

Python can coerce values in simple situations:

In [None]:
float('20.1')

In [None]:
# but not this one

#float('20.1*')

Let's use a loop to detect issues in *Deaths_2017*:

In [None]:
for value in theTable.Deaths_2017:
    try:
        float(value)
    except:
        print(value)

In [None]:
# or 
theTable.Deaths_2017[theTable.Deaths_2017.str.contains('\D')]

Notice that Pandas uses a **float64** dtype by default after the conversion using **to_numeric**. Consider this difference in memory:

In [None]:
theTable.loc[:,['Deaths_2016','Deaths_2021']].astype('Int32').info()

In [None]:
theTable.loc[:,['Deaths_2016','Deaths_2021']].astype('int32').info()

In [None]:
theTable.loc[:,['Deaths_2016','Deaths_2021']].info()

### Exercise

Clean the remaining columns that are suppossed to have numeric values.