# **Cleaning the data**

In [43]:
import pandas as pd

**Reading the raw data**<br>
I made sure to change encoding and symbols to match the data.

In [44]:
oljefondet = pd.read_csv('data/raw_data.csv', sep=';', decimal=',', thousands='.', encoding='utf-16')

**Removing unwanted columns**

In [45]:
oljefondet = oljefondet.drop(columns=['Incorporation Country', 'Market Value USD'])

**Standardizing remanining column names**

In [46]:
oljefondet.columns = ['region', 'country', 'name', 'industry', 'market_value', 'voting', 'ownership']

**Changig datatypes**<br>
Initially the dataframe is using 458.1 KB of memory

In [47]:
oljefondet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8374 entries, 0 to 8373
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   region        8374 non-null   object
 1   country       8374 non-null   object
 2   name          8374 non-null   object
 3   industry      8374 non-null   object
 4   market_value  8374 non-null   int64 
 5   voting        8374 non-null   object
 6   ownership     8374 non-null   object
dtypes: int64(1), object(6)
memory usage: 458.1+ KB


Changig datatypes to categories, strings and floats (keeping market value as an int)

In [49]:
oljefondet['region'] = oljefondet['region'].astype('category')
oljefondet['country'] = oljefondet['country'].astype('category')
oljefondet['name'] = oljefondet['name'].astype('string')
oljefondet['industry'] = oljefondet['industry'].astype('category')
oljefondet['voting'] = pd.to_numeric(oljefondet['voting'].str.replace('%', '').str.replace(',', '.'), errors='coerce')
oljefondet['ownership'] = pd.to_numeric(oljefondet['ownership'].str.replace('%', '').str.replace(',', '.'), errors='coerce')

After changig datatypes the dataframe is using 289.6 KB of memory

In [50]:
oljefondet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8374 entries, 0 to 8373
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   region        8374 non-null   category
 1   country       8374 non-null   category
 2   name          8374 non-null   string  
 3   industry      8374 non-null   category
 4   market_value  8374 non-null   int64   
 5   voting        8374 non-null   float64 
 6   ownership     8374 non-null   float64 
dtypes: category(3), float64(2), int64(1), string(1)
memory usage: 289.6 KB


This reduces the memory usage by 36.78%

In [51]:
improvement = 289.6/458.1
print(f'{(1 - improvement) * 100:.2f}%')

36.78%


**Saving to Parquet-file**<br>
I use Parquet to preserve datatypes and memory efficiency.

In [52]:
oljefondet.to_parquet('data/cleaned_data.parquet')