**Introduction to Python**<br/>
Prof. Dr. Jan Kirenz <br/>
Hochschule der Medien Stuttgart

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-data" data-toc-modified-id="Import-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import data</a></span></li><li><span><a href="#Data-tidying" data-toc-modified-id="Data-tidying-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data tidying</a></span></li></ul></div>

In [1]:
import pandas as pd

To get more information about the Pandas syntax, download the [Pandas code cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

### Import data

In [2]:
# Import data from GitHub (or from your local computer)
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/wage.csv")

### Data tidying

First of all we want to get an overview of the data

In [1]:
# show the head (first few observations in the df)
df.head(3)

In [4]:
# show metadata (take a look at the level of measurement)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 12 columns):
Unnamed: 0    3000 non-null int64
year          3000 non-null int64
age           3000 non-null int64
maritl        3000 non-null object
race          3000 non-null object
education     3000 non-null object
region        3000 non-null object
jobclass      3000 non-null object
health        3000 non-null object
health_ins    3000 non-null object
logwage       3000 non-null float64
wage          3000 non-null float64
dtypes: float64(2), int64(3), object(7)
memory usage: 281.3+ KB


---
**Some notes on data types (level of measurement):** 

If we need to transform variables into a **numerical format**, we can transfrom the data with pd.to_numeric [see Pandas documenation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html):

If the data contains strings, we need to replace them with NaN (not a number). Otherwise we get an error message. Therefore, use errors='coerce' ... 

  * pandas.to_numeric(arg, errors='coerce', downcast=None)

  * errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
  * If ‘raise’, then invalid parsing will raise an exception
  * If ‘coerce’, then invalid parsing will be set as NaN
  * If ‘ignore’, then invalid parsing will return the input
  
To change data into **categorical** format, you can use the following codes:

df['variable'] = pd.Categorical(df['variable'])

If the data is ordinal, we use pandas [CategoricalDtype](https://pandas.pydata.org/pandas-docs/stable/categorical.html)

---

In [4]:
# show all columns in the data
df.columns

Index(['Unnamed: 0', 'year', 'age', 'maritl', 'race', 'education', 'region',
       'jobclass', 'health', 'health_ins', 'logwage', 'wage'],
      dtype='object')

In [6]:
# rename variable "education" to "edu"
df = df.rename(columns={"education": "edu"})

In [9]:
# check levels and frequency of edu
df['edu'].value_counts() 

2. HS Grad            971
4. College Grad       685
3. Some College       650
5. Advanced Degree    426
1. < HS Grad          268
Name: edu, dtype: int64

Convert `edu` to ordinal variable with pandas [CategoricalDtype](https://pandas.pydata.org/pandas-docs/stable/categorical.html)

In [None]:
from pandas.api.types import CategoricalDtype

In [10]:
# convert to ordinal variable
cat_edu = CategoricalDtype(categories=
                            ['1. < HS Grad', 
                             '2. HS Grad', 
                             '3. Some College', 
                             '4. College Grad', 
                             '5. Advanced Degree'],
                            ordered=True)

df.edu = df.edu.astype(cat_edu)

Now convert `race ` to a categorical variable

In [15]:
# convert to categorical variable 
df['race'] = pd.Categorical(df['race'])

Take a look at the metadata (what happend to `edu` and `race`)?

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 12 columns):
Unnamed: 0    3000 non-null int64
year          3000 non-null int64
age           3000 non-null int64
maritl        3000 non-null object
race          3000 non-null object
edu           3000 non-null category
region        3000 non-null object
jobclass      3000 non-null object
health        3000 non-null object
health_ins    3000 non-null object
logwage       3000 non-null float64
wage          3000 non-null float64
dtypes: category(1), float64(2), int64(3), object(6)
memory usage: 261.0+ KB
