# Data type conversions - Intro

In practice, it is quite common that correct (or expected) data types are not automatically determined by pandas. In this chapter, you learn how to perform type conversions while keeping track of conversion problems.

# Preparations

In [1]:
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", 500)

# Loading a dataset with inconsistent data types

In [2]:
# load the following tab-separated text file
df = pd.read_csv("../../data/raw/financial_data_wrong_types.tsv", sep="\t")
df.head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",UNKNOWN,No,,17.9
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,-999,3713.506,No,19317.672,5.274
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,-999,4226.559,No,,
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 824 entries, 0 to 823
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   u_company_name_id      824 non-null    int64  
 1   u_year                 824 non-null    int64  
 2   u_company_name         824 non-null    object 
 3   cb_naics               824 non-null    int64  
 4   u_iso3                 824 non-null    object 
 5   u_fye                  824 non-null    object 
 6   cb_cusip               824 non-null    object 
 7   cb_at                  824 non-null    object 
 8   cb_ni                  824 non-null    object 
 9   cb_financial_industry  824 non-null    object 
 10  cb_revt                648 non-null    float64
 11  employees              680 non-null    float64
dtypes: float64(2), int64(3), object(7)
memory usage: 77.4+ KB


The data types of *cb_at* and *cb_ni* are `object` but are expected to be `float`. Furthermore, the values of *employees* are too low (check the original text file: the values should range into the thousands and should actually be whole numbers, i.e. of type `int`). Also note that *cb_revt* is correctly identified as `float`.

In [4]:
df.employees.describe()

count      680.000000
mean      2568.891131
std       5082.956043
min          1.000000
25%          7.979250
50%         14.455500
75%       1352.750000
max      19994.000000
Name: employees, dtype: float64

# Converting to float or integer

Unless numbers are formatted the American way (. as decimal separator and , as thousands separator), numbers will not be correctly recognized.

In addition, should there be strings as placeholders for missing values (here: "UNKNOWN"), the column will be typed as `object` which can hold strings. An exception are strings like "NA", "N/A" and similar (see [the help for pd.read_csv(), under "na_values"](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)). That is why *cb_revt* was correctly identified as `float`.

The problematic columns *cb_revt*, *cb_at*, *cb_ni* and  *employees* show you different problems you might face in practical datasets:
* **cb_revt**: Empty cells signify missing values, otherwise expected numbering format. Actually a pretty ideal scenario apart from the missingness.

* **cb_ni**: An unusual placeholder for missing values ("UNKNOWN") but otherwise ok.

* **cb_at**: Comma as decimal separator, point as thousands separator. A number (-999) as placeholder for missing values.

* **employees**: Thousands separator "." interpreted as decimal separator. The "N/A" strings are correctly identified as missing by `read_csv()`


## *cb_revt*: a float with empty (i.e. missing) cells

Missing values in pandas, if correctly recognized, are signified by a special value (usually either `numpy.nan` for `floats`, `np.nanT` for `datetime64[ns]`, or `pd.NA` for special types in pandas like `pandas.Int64Dtype`, see [here](https://pandas.pydata.org/docs/user_guide/gotchas.html?highlight=nan#nan-integer-na-values-and-na-type-promotions) and [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) for more details). Such missing values can be ignored when summarizing the data, for example, when calculating a `sum()` or a `mean()` excluding the missing values. The method `describe()` will also work well and produce statistics that exclude the missing values. The number of missing values is indirectly visible when comparing the number of observations for the column with the total number of rows in the DataFrame.

As we can see below, all results are based on *cb_revt* are as we would expect them. So there is nothing to do!

In [5]:
# take a look at the first row which signifies NaN for cb_revt:
cols_of_interest = ["u_company_name", "u_year", "cb_revt"]
df.loc[0:1, cols_of_interest]

Unnamed: 0,u_company_name,u_year,cb_revt
0,British American Tobacco PLC,2005,
1,British American Tobacco PLC,2006,19317.672


In [6]:
# internally, this is a numpy.nan value
print(df.loc[0, "cb_revt"])
np.isnan(df.loc[0, "cb_revt"])

nan


np.True_

In [7]:
# the empty cells are correctly identified as missing
# accordingly, the summary statistics are as expected
df["cb_revt"].describe()

count       648.000000
mean      39611.380448
std       63851.660251
min           0.000000
25%         500.158000
50%       11074.421000
75%       46209.965500
max      379136.000000
Name: cb_revt, dtype: float64

In [8]:
# The function pd.isna() can test if a certain cell has missing values.
# When applied to a column (i.e. a Pandas Series), a Series of booleans is returned
pd.isna(df["cb_revt"])

0       True
1      False
2       True
3      False
4      False
       ...  
819    False
820    False
821    False
822    False
823     True
Name: cb_revt, Length: 824, dtype: bool

In [9]:
# Alternatively, each Series (and therefore each column) has an equivalent `isna()` method:
df["cb_revt"].isna()

0       True
1      False
2       True
3      False
4      False
       ...  
819    False
820    False
821    False
822    False
823     True
Name: cb_revt, Length: 824, dtype: bool

In [10]:
# How many observations are missing for cb_revt?
print(f"Missing observations for cb_revt: {sum(df['cb_revt'].isna())}")
print(f"in %: {sum(df['cb_revt'].isna()) / len(df) * 100: .2f}%")

Missing observations for cb_revt: 176
in %:  21.36%


## Side-Track: Methods of `pandas.Series` and how they work with missing values

`pandas.Series` has many methods (see [https://pandas.pydata.org/docs/reference/api/pandas.Series.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)) for calculating summary statistics. These typically ignore missing values (e.g. `count()`, `mean()`, `std()`). Throughout the workshop, we will use more and more of these and other methods.

In [11]:
print(f"pandas.Series.count(): {df['cb_revt'].count()}")
print(f"pandas.Series.mean():  {df['cb_revt'].mean()}")
print(f"pandas.Series.std():   {df['cb_revt'].std()}")

pandas.Series.count(): 648
pandas.Series.mean():  39611.38044753086
pandas.Series.std():   63851.66025137965


This is different in the `numpy` functions of the same name, which return missing value if any values of a `numpy.ndarray` are missing (there are special version of these function, e.g. `numpy.nanmean()` etc. that work like the `pandas.Series`-methods).

The `pandas.Series.value` property accesses the native `numpy.ndarray` underlying the `pandas.Series`. That array then behaves like any `numpy.ndarray`.

In [12]:
import numpy as np

# the .values property of pd
print(f"type(df['cb_revt']):              {type(df['cb_revt'])}")
print(f"type(df.df['cb_revt'].values):    {type(df['cb_revt'].values)}")
print(f"np.mean(df['cb_revt'].values):    {np.mean(df['cb_revt'].values)}")
print(f"np.nanmean(df['cb_revt'].values): {np.nanmean(df['cb_revt'].values)}")
print(f"np.mean(df['cb_revt']):           {np.mean(df['cb_revt'])}")

type(df['cb_revt']):              <class 'pandas.core.series.Series'>
type(df.df['cb_revt'].values):    <class 'numpy.ndarray'>
np.mean(df['cb_revt'].values):    nan
np.nanmean(df['cb_revt'].values): 39611.38044753086
np.mean(df['cb_revt']):           39611.38044753086


## *cb_ni*: An unusual placeholder for missing values ("UNKNOWN") but otherwise ok.

It seems that the main problem with *cb_ni* is that the missing values are not recognized as such. As a consequence, the data type is coerced to `object` (here, more specifically, a `str`).

Our task is to change the data type of the column *cb_ni* to `float` having missing values where the original value was "UNKNOWN".

Generally, when proceeding with a type conversion, it is advisable to proceed step by step, so that mistakes in the conversion process are more visible:
1. Explicitly replace known placeholder for missingness with the appropriate special value (usually np.nan for floats)
2. Note how many missing values are present before any type conversion
3. Prepare the column for type conversion (e.g. replace ',' with '.' and '.' with '' when number format is "German")
4. Perform the data type conversion
5. This can lead to additional NA entries if you have overlooked a problem: check for this, comparing the number of missings with the number recorded in step

In [13]:
cols_of_interest = ["u_company_name", "u_year", "cb_ni"]
df.loc[:, cols_of_interest]

Unnamed: 0,u_company_name,u_year,cb_ni
0,British American Tobacco PLC,2005,UNKNOWN
1,British American Tobacco PLC,2006,3713.506
2,British American Tobacco PLC,2007,4226.559
3,British American Tobacco PLC,2008,3591.888
4,British American Tobacco PLC,2009,4386.107
...,...,...,...
819,Midatech Pharma PLC,2017,UNKNOWN
820,Midatech Pharma PLC,2018,UNKNOWN
821,Motif Bio plc,2016,-40.324
822,Motif Bio plc,2017,UNKNOWN


In [14]:
# 1. Explicitly replace known placeholder for missingness with the appropriate special value (usually np.nan for floats
df["cb_ni_corrected"] = df["cb_ni"].replace("UNKNOWN", np.nan)

# 2. Note how many missing values are present before any type conversion
missing_before = df["cb_ni_corrected"].isna().sum()
print(f"Missing before replacing 'UNKNOWN' with np.nan: {df['cb_ni'].isna().sum()}")
print(f"Missing after replacing 'UNKNOWN' with np.nan:  {missing_before}")
# 3. Prepare the column for type conversion (e.g. replace ',' with '.' and '.' with '' when number format is "German")
# => not required here!

# 4. Perform the data type conversion
df["cb_ni_corrected"] = df["cb_ni_corrected"].astype("float")

# 5. This can lead to additional NA entries if you have overlooked a problem:
#   check for this, comparing the number of missings with the number recorded in step 2.
mssing_after = df["cb_ni_corrected"].isna().sum()
print(f"Missing after conversion to float:              {missing_before}")

# => all ok!

Missing before replacing 'UNKNOWN' with np.nan: 0
Missing after replacing 'UNKNOWN' with np.nan:  177
Missing after conversion to float:              177


Why is thes check in steps 2 and 5 necessary. Let us make it more complicated by creating an even more messed up column *cb_ni_messed_up*:

In [15]:
# start out with the same data
df["cb_ni_messed_up"] = df["cb_ni"]

# replace a few values with some differently formatted numbers
df.loc[df["u_year"] == 2006, "cb_ni_messed_up"] = df.loc[
    df["u_year"] == 2006, "cb_ni_messed_up"
].str.replace(".", ",", regex=False)

# inspect
df.loc[:, [*cols_of_interest, "cb_ni_messed_up"]].head()

Unnamed: 0,u_company_name,u_year,cb_ni,cb_ni_messed_up
0,British American Tobacco PLC,2005,UNKNOWN,UNKNOWN
1,British American Tobacco PLC,2006,3713.506,3713506
2,British American Tobacco PLC,2007,4226.559,4226.559
3,British American Tobacco PLC,2008,3591.888,3591.888
4,British American Tobacco PLC,2009,4386.107,4386.107


In [16]:
cols_of_interest = [
    "u_company_name",
    "u_year",
    "cb_ni",
    "cb_ni_messed_up",
    "cb_ni_corrected",
]
# Now repeat out procedure with the messed column

# 1. Explicitly replace known placeholder for missingness with the appropriate special value (usually np.nan for floats
df["cb_ni_corrected"] = df["cb_ni_messed_up"].replace("UNKNOWN", np.nan)

# 2. Note how many missing values are present before any type conversion
missing_before = df["cb_ni_corrected"].isna().sum()
print(f"Missing before replacing 'UNKNOWN' with np.nan: {df['cb_ni_messed_up'].isna().sum()}")
print(f"Missing after replacing 'UNKNOWN' with np.nan:  {missing_before}")
# 3. Prepare the column for type conversion (e.g. replace ',' with '.' and '.' with '' when number format is "German")
# => would actually be required here, but to prove the point we leave the data messed up

# 4. Perform the data type conversion
# df["cb_ni_corrected"] = df["cb_ni_corrected"].astype("float") would actually throw an error (which is good!).
# But to demonstrate, we use a more flexible (but also more "dangerous" option)
df["cb_ni_corrected"] = pd.to_numeric(df["cb_ni_corrected"], errors="coerce")

# 5. This can lead to additional NA entries if you have overlooked a problem:
#   check for this, comparing the number of missings with the number recorded in step 2.
mssing_after = df["cb_ni_corrected"].isna().sum()
print(f"Missing after conversion to float:              {mssing_after}")

print(f"Additional missing values:                      {mssing_after - missing_before}")
df.loc[:, [*cols_of_interest, "cb_ni_messed_up", "cb_ni_corrected"]].head()

Missing before replacing 'UNKNOWN' with np.nan: 0
Missing after replacing 'UNKNOWN' with np.nan:  177
Missing after conversion to float:              217
Additional missing values:                      40


Unnamed: 0,u_company_name,u_year,cb_ni,cb_ni_messed_up,cb_ni_corrected,cb_ni_messed_up.1,cb_ni_corrected.1
0,British American Tobacco PLC,2005,UNKNOWN,UNKNOWN,,UNKNOWN,
1,British American Tobacco PLC,2006,3713.506,3713506,,3713506,
2,British American Tobacco PLC,2007,4226.559,4226.559,4226.559,4226.559,4226.559
3,British American Tobacco PLC,2008,3591.888,3591.888,3591.888,3591.888,3591.888
4,British American Tobacco PLC,2009,4386.107,4386.107,4386.107,4386.107,4386.107


In [17]:
# Now fix the error in step 4, as you should

# 1. Explicitly replace known placeholder for missingness with the appropriate special value (usually np.nan for floats
df["cb_ni_corrected"] = df["cb_ni_messed_up"].replace("UNKNOWN", np.nan)

# 2. Note how many missing values are present before any type conversion
missing_before = df["cb_ni_corrected"].isna().sum()
print(f"Missing before replacing 'UNKNOWN' with np.nan: {df['cb_ni_messed_up'].isna().sum()}")
print(f"Missing after replacing 'UNKNOWN' with np.nan:  {missing_before}")
# 3. Prepare the column for type conversion (e.g. replace ',' with '.' and '.' with '' when number format is "German")
df.loc[df["u_year"] == 2006, "cb_ni_corrected"] = df.loc[
    df["u_year"] == 2006, "cb_ni_corrected"
].str.replace(",", ".", regex=False)

# 4. Perform the data type conversion
# df["cb_ni_corrected"] = df["cb_ni_corrected"].astype("float") would actually throw an error (which is good!).
# But to demonstrate, we use a more flexible (but also more "dangerous" option)
df["cb_ni_corrected"] = pd.to_numeric(df["cb_ni_corrected"], errors="coerce")

# 5. This can lead to additional NA entries if you have overlooked a problem:
#   check for this, comparing the number of missings with the number recorded in step 2.
mssing_after = df["cb_ni_corrected"].isna().sum()
print(f"Missing after conversion to float:              {mssing_after}")

print(f"Additional missing values:                      {mssing_after - missing_before}")
df[cols_of_interest].head()

Missing before replacing 'UNKNOWN' with np.nan: 0
Missing after replacing 'UNKNOWN' with np.nan:  177
Missing after conversion to float:              177
Additional missing values:                      0


Unnamed: 0,u_company_name,u_year,cb_ni,cb_ni_messed_up,cb_ni_corrected
0,British American Tobacco PLC,2005,UNKNOWN,UNKNOWN,
1,British American Tobacco PLC,2006,3713.506,3713506,3713.506
2,British American Tobacco PLC,2007,4226.559,4226.559,4226.559
3,British American Tobacco PLC,2008,3591.888,3591.888,3591.888
4,British American Tobacco PLC,2009,4386.107,4386.107,4386.107


## *cb_at*: Comma as decimal separator, point as thousands separator. A number (-999) as placeholder for missing values.

We can follow the same strategy as before:

In [18]:
cols_of_interest = ["u_company_name", "u_year", "cb_at", "cb_at_corrected"]

# 1. Explicitly replace known placeholder for missingness with the appropriate special value (usually np.nan for floats
df["cb_at_corrected"] = df["cb_at"].replace("-999", np.nan)

# 2. Note how many missing values are present before any type conversion
missing_before = df["cb_at_corrected"].isna().sum()
print(f"Missing before replacing 'UNKNOWN' with np.nan: {df['cb_at'].isna().sum()}")
print(f"Missing after replacing 'UNKNOWN' with np.nan:  {missing_before}")
# 3. Prepare the column for type conversion (e.g. replace ',' with '.' and '.' with '' when number format is "German")
df["cb_at_corrected"] = (
    df["cb_at_corrected"].str.replace(".", "", regex=False).str.replace(",", ".", regex=False)
)

# 4. Perform the data type conversion
df["cb_at_corrected"] = df["cb_at_corrected"].astype("float")

# 5. This can lead to additional NA entries if you have overlooked a problem:
#   check for this, comparing the number of missings with the number recorded in step 2.
mssing_after = df["cb_at_corrected"].isna().sum()
print(f"Missing after conversion to float:              {mssing_after}")
print(f"Additional missing values:                      {mssing_after - missing_before}")

df[cols_of_interest].head()

Missing before replacing 'UNKNOWN' with np.nan: 0
Missing after replacing 'UNKNOWN' with np.nan:  181
Missing after conversion to float:              181
Additional missing values:                      0


Unnamed: 0,u_company_name,u_year,cb_at,cb_at_corrected
0,British American Tobacco PLC,2005,"32.737,984",32737.984
1,British American Tobacco PLC,2006,-999,
2,British American Tobacco PLC,2007,-999,
3,British American Tobacco PLC,2008,"40.276,807",40276.807
4,British American Tobacco PLC,2009,"43.026,854",43026.854


## *employees*: Thousands separator "." interpreted as decimal separator. The "N/A" strings are correctly identified as missing by `read_csv()`

It can sometimes happen, that read_csv() leads to a loss of information. The column *employees* is an example because all cells could be successfully converted to a float (empty cells are correctly turned into np.nan) but because there never is a decimal number, the thousands separators "." that exist in some cases in the raw data, are misinterpreted!

To compare the raw data with the original result of `read_csv()`, we re-read the file but this time do not ask for automatic typing:

In [19]:
df_str = pd.read_csv("../../data/raw/financial_data_wrong_types.tsv", sep="\t", dtype="str")
df_str.info()
df_str.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 824 entries, 0 to 823
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   u_company_name_id      824 non-null    object
 1   u_year                 824 non-null    object
 2   u_company_name         824 non-null    object
 3   cb_naics               824 non-null    object
 4   u_iso3                 824 non-null    object
 5   u_fye                  824 non-null    object
 6   cb_cusip               824 non-null    object
 7   cb_at                  824 non-null    object
 8   cb_ni                  824 non-null    object
 9   cb_financial_industry  824 non-null    object
 10  cb_revt                648 non-null    object
 11  employees              680 non-null    object
dtypes: object(12)
memory usage: 77.4+ KB


Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,"32.737,984",UNKNOWN,No,,17.9
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,-999,3713.506,No,19317.672,5.274
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,-999,4226.559,No,,
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,"40.276,807",3591.888,No,17721.152,11.038
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,"43.026,854",4386.107,No,22970.074,8.961


In [20]:
# the US rows have no thousands separator:
df_str[df_str["u_iso3"] == "USA"].head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry,cb_revt,employees
15,13722,2005,Bob Evans Farms Inc.,311612,USA,2006-04-30,96761101,"1.209,183",54.774,No,1584.819,
16,13722,2006,Bob Evans Farms Inc.,311612,USA,2007-04-30,96761101,"1.196,962",60.542,No,1654.46,
17,13722,2007,Bob Evans Farms Inc.,311612,USA,2008-04-30,96761101,-999,64.876,No,1737.026,2345.0
18,13722,2008,Bob Evans Farms Inc.,311612,USA,2009-04-30,96761101,"1.147,648",UNKNOWN,No,1750.512,15882.0
19,13722,2009,Bob Evans Farms Inc.,311612,USA,2010-04-30,96761101,"1.109,157",UNKNOWN,No,1726.804,


In [21]:
# See how the original read_csv() changed the values for the non-USA vs. USA firms
df[df["u_iso3"] != "USA"].employees.describe()

count    502.000000
mean      32.316671
std      126.258353
min        1.000000
25%        5.910750
50%       11.152000
75%       16.012000
max      983.000000
Name: employees, dtype: float64

In [22]:
df[df["u_iso3"] == "USA"].employees.describe()

count      178.000000
mean      9722.601124
std       5417.972500
min        716.000000
25%       5290.750000
50%       9286.000000
75%      14308.750000
max      19994.000000
Name: employees, dtype: float64

There are two ways to address this kind of issue:
1. re-read the respective column as string (using the `dtype="str"` option) and then proceed as before with the steps 1-5
2. knowing that for Non-US-firms the *employees* are now denoted in "thousands", we could multiply by 1,000 and would have correct numbers again. Alternatively, we could divide USA-values by 1,000.

### Option 1. re-read the respective column as string and proceed with the usual steps

Re-`read_csv()` the file using the `dtype="str"` option and then proceed as before with the steps 1-5.

We make use of the fact that we **have not changed the sorting and have not deleted any rows so far!**

Please note that the `int` type does not support NAs (see the links above). We could use a float instead. Floats can then also be converted to one of the special integer types in pandas, such as `pandas.Int32Dtype`.

In [23]:
cols_of_interest = [
    "u_company_name",
    "u_year",
    "u_iso3",
    "employees",
    "employees_str",
    "employees_corrected",
]
df_str = pd.read_csv("../../data/raw/financial_data_wrong_types.tsv", sep="\t", dtype="str")

# fill the main df with the string-data from the newly read file (sorting and number of rows must be the same!)
df["employees_str"] = df_str.employees

# now proceed as usual
# 1. Explicitly replace known placeholder for missingness with the appropriate special value (usually np.nan for floats
df["employees_corrected"] = df.employees_str.replace("-999", np.nan)

# 2. Note how many missing values are present before any type conversion
missing_before = df.employees_str.isna().sum()
print(f"Missing before replacing 'UNKNOWN' with np.nan: {df.employees_str.isna().sum()}")
print(f"Missing after replacing 'UNKNOWN' with np.nan:  {missing_before}")
# 3. Prepare the column for type conversion (e.g. replace ',' with '.' and '.' with '' when number format is "German")
df["employees_corrected"] = df.employees_corrected.str.replace(".", "", regex=False)

# 4. Perform the data type conversion
# df["employees_corrected"] = df["employees_corrected"].astype("int") # this will throw an error because `int` does not support NaN
df["employees_corrected"] = df["employees_corrected"].astype("float").astype(pd.Int64Dtype())


# 5. This can lead to additional NA entries if you have overlooked a problem:
#   check for this, comparing the number of missings with the number recorded in step 2.
mssing_after = df.employees_corrected.isna().sum()
print(f"Missing after conversion to float:              {mssing_after}")

print(f"Additional missing values:                      {mssing_after - missing_before}")
df[cols_of_interest].head()

Missing before replacing 'UNKNOWN' with np.nan: 144
Missing after replacing 'UNKNOWN' with np.nan:  144
Missing after conversion to float:              144
Additional missing values:                      0


Unnamed: 0,u_company_name,u_year,u_iso3,employees,employees_str,employees_corrected
0,British American Tobacco PLC,2005,GBR,17.9,17.9,17900.0
1,British American Tobacco PLC,2006,GBR,5.274,5.274,5274.0
2,British American Tobacco PLC,2007,GBR,,,
3,British American Tobacco PLC,2008,GBR,11.038,11.038,11038.0
4,British American Tobacco PLC,2009,GBR,8.961,8.961,8961.0


### 2. Multiply *employees* by 1,000 for non-USA firms

In [24]:
df["employees_corrected2"] = df.employees
df.loc[df["u_iso3"] != "USA", "employees_corrected2"] = (
    df.loc[df["u_iso3"] != "USA", "employees_corrected2"] * 1000
)
df.loc[df["u_iso3"] != "USA", [*cols_of_interest, "employees_corrected2"]].head()

Unnamed: 0,u_company_name,u_year,u_iso3,employees,employees_str,employees_corrected,employees_corrected2
0,British American Tobacco PLC,2005,GBR,17.9,17.9,17900.0,17900.0
1,British American Tobacco PLC,2006,GBR,5.274,5.274,5274.0,5274.0
2,British American Tobacco PLC,2007,GBR,,,,
3,British American Tobacco PLC,2008,GBR,11.038,11.038,11038.0,11038.0
4,British American Tobacco PLC,2009,GBR,8.961,8.961,8961.0,8961.0


# Converting to boolean

The column *cb_financial_industry* contains strings "Yes" and "No" and could therefore be transformed to boolean which is more efficient and also easier to use.


In [25]:
# For example, selecting financial firms using the current `object` (string) data type:
cols_of_interest = ["u_company_name", "u_year", "cb_financial_industry"]
df.loc[(df["cb_financial_industry"] == "Yes") & (df["u_year"] == 2005), cols_of_interest].head()

Unnamed: 0,u_company_name,u_year,cb_financial_industry
86,Freddie Mac,2005,Yes
101,Allianz SE,2005,Yes
241,AvalonBay Communities Inc.,2005,Yes
279,"Axa, Paris",2005,Yes
332,Waddell & Reed Financial Inc.,2005,Yes


Note that `astype("bool")` only works with numeric columns:

In [26]:
pd.Series(["Yes", "No"]).astype("bool")

0    True
1    True
dtype: bool

In [27]:
pd.Series(["True", "False"]).astype("bool")

0    True
1    True
dtype: bool

In [28]:
pd.Series([1, 0, 55]).astype("bool")

0     True
1    False
2     True
dtype: bool

Therefore, for converting "Yes"/"No", we have to use somewhat more elaborate functions. There are several possibilities:

In [29]:
# Using the replace method with a dictionary
df["cb_financial_industry_bool"] = df["cb_financial_industry"].map({"Yes": True, "No": False})
df.loc[(df["u_year"] == 2005), [*cols_of_interest, "cb_financial_industry_bool"]].head(10)

Unnamed: 0,u_company_name,u_year,cb_financial_industry,cb_financial_industry_bool
0,British American Tobacco PLC,2005,No,False
15,Bob Evans Farms Inc.,2005,No,False
27,BP PLC,2005,No,False
42,BT Group PLC,2005,No,False
56,Glaxosmithkline PLC,2005,No,False
71,LVMH Moet Hennessy Louis Vuitton SE,2005,No,False
86,Freddie Mac,2005,Yes,True
101,Allianz SE,2005,Yes,True
115,EOG Resources Inc.,2005,No,False
130,BASF SE,2005,No,False


In [30]:
df["cb_financial_industry_bool"].dtype

dtype('bool')

The replace-method and the custom function can also more easily deal with some third value (e.g. "NOT AVAILABLE"):

In [31]:
# introducing some missing values
df["cb_financial_industry2"] = df["cb_financial_industry"]
df.loc[0, "cb_financial_industry2"] = "Not Available"
df.loc[
    (df["u_year"] == 2005),
    [*cols_of_interest, "cb_financial_industry2", "cb_financial_industry_bool"],
].head(10)

Unnamed: 0,u_company_name,u_year,cb_financial_industry,cb_financial_industry2,cb_financial_industry_bool
0,British American Tobacco PLC,2005,No,Not Available,False
15,Bob Evans Farms Inc.,2005,No,No,False
27,BP PLC,2005,No,No,False
42,BT Group PLC,2005,No,No,False
56,Glaxosmithkline PLC,2005,No,No,False
71,LVMH Moet Hennessy Louis Vuitton SE,2005,No,No,False
86,Freddie Mac,2005,Yes,Yes,True
101,Allianz SE,2005,Yes,Yes,True
115,EOG Resources Inc.,2005,No,No,False
130,BASF SE,2005,No,No,False


In [32]:
# tabulate the different unique values in the new column:
print(df.groupby("cb_financial_industry2").size())

cb_financial_industry2
No               692
Not Available      1
Yes              131
dtype: int64


In [33]:
# Using the replace method with a dictionary including the "Not Available" value
df["cb_financial_industry_bool"] = df["cb_financial_industry_bool"].replace(
    {"Yes": True, "No": False, "Not Available": np.nan}, inplace=False
)
df.loc[
    (df["u_year"] == 2005),
    [*cols_of_interest, "cb_financial_industry2", "cb_financial_industry_bool"],
].head(10)

Unnamed: 0,u_company_name,u_year,cb_financial_industry,cb_financial_industry2,cb_financial_industry_bool
0,British American Tobacco PLC,2005,No,Not Available,False
15,Bob Evans Farms Inc.,2005,No,No,False
27,BP PLC,2005,No,No,False
42,BT Group PLC,2005,No,No,False
56,Glaxosmithkline PLC,2005,No,No,False
71,LVMH Moet Hennessy Louis Vuitton SE,2005,No,No,False
86,Freddie Mac,2005,Yes,Yes,True
101,Allianz SE,2005,Yes,Yes,True
115,EOG Resources Inc.,2005,No,No,False
130,BASF SE,2005,No,No,False


# Exercise

1. Load the first sheet of the Excel file "wdi_wrong_types.xlsx" into a pandas DataFrame.
2. What are the data types of the columns as determined by `read_excel()`?
3. The columns *CM_MKT_LCAP_CD*, *SP_DYN_LE00_IN*, and *SP_URB_TOTL_IN_ZS* should be numeric. Perform the necessary conversions using the steps from above.
4. The column *euro_area* should be boolean. Perform the necessary conversion using the steps from above.
5. Bonus: The column *incomegroup_and_id* actually contains two pieces of information, the countries' income group name and the respective ids. The two pieces are separated with a "|" character. Create two separate columns: a string column (`object`) for the income group name and an `int` column for the id. Hint: you might consider the method `pandas.Series.str.split()`, see [https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html).