**Introduction to Python**<br/>
Prof. Dr. Jan Kirenz <br/>
Hochschule der Medien Stuttgart

In [1]:
import pandas as pd

To get more information about the Pandas syntax, download the [Pandas code cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

### Import data

In [2]:
# Import data from GitHub (or from your local computer)
df = pd.read_csv("https://raw.githubusercontent.com/kirenz/datasets/master/wage.csv")

### Data tidying

First of all we want to get an overview of the data

In [3]:
# show the head (first few observations in the df)
df.head(3)

Unnamed: 0.1,Unnamed: 0,year,age,maritl,race,education,region,jobclass,health,health_ins,logwage,wage
0,231655,2006,18,1. Never Married,1. White,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.318063,75.043154
1,86582,2004,24,1. Never Married,1. White,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,2. No,4.255273,70.47602
2,161300,2003,45,2. Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,1. <=Good,1. Yes,4.875061,130.982177


In [4]:
# show metadata (take a look at the level of measurement)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  3000 non-null   int64  
 1   year        3000 non-null   int64  
 2   age         3000 non-null   int64  
 3   maritl      3000 non-null   object 
 4   race        3000 non-null   object 
 5   education   3000 non-null   object 
 6   region      3000 non-null   object 
 7   jobclass    3000 non-null   object 
 8   health      3000 non-null   object 
 9   health_ins  3000 non-null   object 
 10  logwage     3000 non-null   float64
 11  wage        3000 non-null   float64
dtypes: float64(2), int64(3), object(7)
memory usage: 281.4+ KB


---
**Some notes on data types (level of measurement):** 

If we need to transform variables into a **numerical format**, we can transfrom the data with pd.to_numeric [see Pandas documenation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html):

If the data contains strings, we need to replace them with `NaN` (not a number). Otherwise we get an error message. Therefore, use `errors='coerce'` ... 

  * pandas.to_numeric(arg, errors='coerce', downcast=None)

  * errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
  * If ‘raise’, then invalid parsing will raise an exception
  * If ‘coerce’, then invalid parsing will be set as NaN
  * If ‘ignore’, then invalid parsing will return the input
  
To change data into **categorical** format, you can use the following codes:

df['variable'] = pd.Categorical(df['variable'])

If the data is ordinal, we use pandas [CategoricalDtype](https://pandas.pydata.org/pandas-docs/stable/categorical.html)

---

In [5]:
# show all columns in the data
df.columns

Index(['Unnamed: 0', 'year', 'age', 'maritl', 'race', 'education', 'region',
       'jobclass', 'health', 'health_ins', 'logwage', 'wage'],
      dtype='object')

In [6]:
# rename variable "education" to "edu"
df = df.rename(columns={"education": "edu"})

In [7]:
# check levels and frequency of edu
df['edu'].value_counts() 

2. HS Grad            971
4. College Grad       685
3. Some College       650
5. Advanced Degree    426
1. < HS Grad          268
Name: edu, dtype: int64

Convert `edu` to ordinal variable with pandas [CategoricalDtype](https://pandas.pydata.org/pandas-docs/stable/categorical.html)

In [8]:
from pandas.api.types import CategoricalDtype

In [9]:
# convert to ordinal variable
cat_edu = CategoricalDtype(categories=
                            ['1. < HS Grad', 
                             '2. HS Grad', 
                             '3. Some College', 
                             '4. College Grad', 
                             '5. Advanced Degree'],
                            ordered=True)

df.edu = df.edu.astype(cat_edu)

Now convert `race ` to a categorical variable

In [10]:
# convert to categorical variable 
df['race'] = pd.Categorical(df['race'])

Take a look at the metadata (what happend to `edu` and `race`)?

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Unnamed: 0  3000 non-null   int64   
 1   year        3000 non-null   int64   
 2   age         3000 non-null   int64   
 3   maritl      3000 non-null   object  
 4   race        3000 non-null   category
 5   edu         3000 non-null   category
 6   region      3000 non-null   object  
 7   jobclass    3000 non-null   object  
 8   health      3000 non-null   object  
 9   health_ins  3000 non-null   object  
 10  logwage     3000 non-null   float64 
 11  wage        3000 non-null   float64 
dtypes: category(2), float64(2), int64(3), object(5)
memory usage: 240.8+ KB
