# Introduction

Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. In that case, you'll learn how to use pandas functions to change the names of the offending entries to something better.

You'll also explore how to combine data from multiple DataFrames and/or Series.

# Renaming

The first function we'll introduce here is `rename()`, which lets you change index names and/or column names. For example, to change the `Salary` column in our dataset to `Renumeration`, we would do:

In [21]:

import pandas as pd
pd.set_option('max_rows', 5)
data = pd.read_csv("Salary_Data.csv", index_col=0)

In [22]:
data = data.rename(columns={'Salary': 'Renumeration'})

`rename()` lets you rename index _or_ column values by specifying a `index` or `column` keyword parameter, respectively. It supports a variety of input formats, but usually a Python dictionary is the most convenient. Here is an example using it to rename some elements of the index.

In [23]:
data.head()

Unnamed: 0_level_0,YearsExperience,Renumeration,Skill,Age,education,expense,savings
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,100.0,3900,c++,22,bachelor,31200.0,7800.0
2,110.0,390000,python,22,masters,31200.0,7800.0
3,2.0,37731,c++,24,bachelor,30184.8,7546.2
4,2.0,43525,,26,bachelor,34820.0,8705.0
5,2.2,39891,c,23,masters,31912.8,7978.2


In [24]:
data = data.rename(index={0: 'x', 1: 'S'})

In [25]:
data.head()

Unnamed: 0_level_0,YearsExperience,Renumeration,Skill,Age,education,expense,savings
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
S,100.0,3900,c++,22,bachelor,31200.0,7800.0
2,110.0,390000,python,22,masters,31200.0,7800.0
3,2.0,37731,c++,24,bachelor,30184.8,7546.2
4,2.0,43525,,26,bachelor,34820.0,8705.0
5,2.2,39891,c,23,masters,31912.8,7978.2


You'll probably rename columns very often, but rename index values very rarely.  For that, `set_index()` is usually more convenient.



# Combining

When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways. Pandas has three core methods for doing this. In order of increasing complexity, these are `concat()`, `join()`, and `merge()`. Most of what `merge()` can do can also be done more simply with `join()`, so we will omit it and focus on the first two functions here.

The simplest combining method is `concat()`. Given a list of elements, this function will smush those elements together along an axis.



In [26]:
copy = pd.read_csv("Salary_Data - Copy.csv")

pd.concat([data, copy])

Unnamed: 0,YearsExperience,Renumeration,Skill,Age,education,expense,savings,id,Salary
S,100.0,3900.0,c++,22,bachelor,31200.0,7800.0,,
2,110.0,390000.0,python,22,masters,31200.0,7800.0,,
...,...,...,...,...,...,...,...,...,...
28,10.3,,python,26,masters,97912.8,24478.2,29.0,122391.0
29,10.5,,c++,27,bachelor,97497.6,24374.4,30.0,121872.0


The middlemost combiner in terms of complexity is `join()`. `join()` lets you combine different DataFrame objects which have an index in common. For example, to pull down videos that happened to be trending on the same day in _both_ Canada and the UK, we could do the following:

In [27]:

# result = data.join(copy, how='inner' )
result = pd.merge(data,copy,on='id')
result

Unnamed: 0,id,YearsExperience_x,Renumeration,Skill_x,Age_x,education_x,expense_x,savings_x,YearsExperience_y,Salary,Skill_y,Age_y,education_y,expense_y,savings_y
0,2,110.0,390000,python,22,masters,31200.0,7800.0,110.0,390000,python,22,masters,31200.0,7800.0
1,3,2.0,37731,c++,24,bachelor,30184.8,7546.2,2.0,37731,c++,24,bachelor,30184.8,7546.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27,29,10.3,122391,python,26,masters,97912.8,24478.2,10.3,122391,python,26,masters,97912.8,24478.2
28,30,10.5,121872,c++,27,bachelor,97497.6,24374.4,10.5,121872,c++,27,bachelor,97497.6,24374.4


The `lsuffix` and `rsuffix` parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn't true (because, say, we'd renamed them beforehand) we wouldn't need them.