**Case Study:** Multiple variables are stored in the same column. 

Sometimes, column headings can contain information on more than one variable. Consider the data in the file "tb.csv" which contains information on the count of confirmed tuberculosis cases by country, year, age and sex. The data comes from Tubercolosis Records from World Health Organization. This is only a subset of a much larger (and messier) data set. 

In [40]:
import pandas as pd
file = "tb.csv"
data = pd.read_csv(file)

**Example:** Step 1 - rename the 'iso2' variable to something more meaningful. 

In [41]:
data.rename(columns = {'iso2':'country'}, inplace = True)
data

Unnamed: 0,country,year,new_sp_m014,new_sp_m1524,new_sp_m2534,new_sp_m3544,new_sp_m4554,new_sp_m5564,new_sp_m65,new_sp_f014,new_sp_f1524,new_sp_f2534,new_sp_f3544,new_sp_f4554,new_sp_f5564,new_sp_f65
0,AE,2000,2,4,4,6,5,12,10,3,16,1,3,0,0,4
1,AF,2000,52,228,183,149,129,94,80,93,414,565,339,205,99,36
2,AG,2000,0,0,0,0,0,0,1,1,1,1,0,0,0,0
3,AL,2000,2,19,21,14,24,19,16,3,11,10,8,8,5,11
4,AM,2000,2,152,130,131,63,26,21,1,24,27,24,8,8,4
5,AN,2000,0,0,1,2,0,0,0,0,0,1,0,0,1,0
6,AO,2000,186,999,1003,912,482,312,194,247,1142,1091,844,417,200,120
7,AR,2000,97,278,594,402,419,368,330,121,544,479,262,230,179,216
8,AT,2000,1,17,30,59,42,23,41,1,11,22,12,11,6,22
9,AU,2000,3,16,35,25,24,19,49,0,15,19,12,15,5,14


**Example:** Step 2 - melt the data set (for now with a combined column for sex and age). 

In [42]:
data = pd.melt(data, id_vars=["country","year"],     # variables to keep in place 
               var_name="sex_and_age",               # variables to stack into one variable
               value_name="cases")                   # new column of numbers/results
data    

Unnamed: 0,country,year,sex_and_age,cases
0,AE,2000,new_sp_m014,2
1,AF,2000,new_sp_m014,52
2,AG,2000,new_sp_m014,0
3,AL,2000,new_sp_m014,2
4,AM,2000,new_sp_m014,2
...,...,...,...,...
261,AN,2001,new_sp_f65,0
262,AO,2001,new_sp_f65,182
263,AR,2001,new_sp_f65,249
264,AT,2001,new_sp_f65,18


**Example:** Step 3 - Take the strings apart for the "sex_and_age" variable. We're making three new columns, sex (with values 'f' and 'm'), lower (lower bound for age group), and upper (upper bound for age group). 

In [43]:
tmp_df = data["sex_and_age"].str.lstrip("new_sp_")
[sex, lower, upper] = [tmp_df.str[0], tmp_df.str[1:-2], tmp_df.str[-2:]]

**Example:** Step 4 - attach new columns to data frame. Remove the now obsolete 'sex_and_age' column. 

In [44]:
data['sex'] = sex
data['age'] = lower + "-" + upper
del data['sex_and_age']
data

Unnamed: 0,country,year,cases,sex,age
0,AE,2000,2,m,0-14
1,AF,2000,52,m,0-14
2,AG,2000,0,m,0-14
3,AL,2000,2,m,0-14
4,AM,2000,2,m,0-14
...,...,...,...,...,...
261,AN,2001,0,f,-65
262,AO,2001,182,f,-65
263,AR,2001,249,f,-65
264,AT,2001,18,f,-65
