Modifying DataFrames
In the previous lesson, you learned what a DataFrame is and how to select subsets of data from one.

In this lesson, you'll learn how to modify an existing DataFrame. Some of the skills you'll learn include:

    Adding columns to a DataFrame
    Using lambda functions to calculate complex quantities
    Renaming columns

### Adding a Column I
Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

The DataFrame df contains information on products sold at a hardware store. Add a column to df called 'Sold in Bulk?', which indicates if the product is sold in bulk or individually.

In [1]:
import pandas as pd

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

df['Sold in Bulk?'] = ['Yes', 'Yes', 'No', 'No']

print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?
0           1  3 inch screw                  0.5   0.75           Yes
1           2   2 inch nail                  0.1   0.25           Yes
2           3        hammer                  3.0   5.50            No
3           4   screwdriver                  2.5   3.00            No


### Adding a Column II
We can also add a new column that is the same for all rows in the DataFrame.

Add a column to df called Is taxed?, which indicates whether or not to collect sales tax on the product. It should be 'Yes' for all rows.

In [2]:
df['Is taxed?'] = 'Yes'

print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?  \
0           1  3 inch screw                  0.5   0.75           Yes   
1           2   2 inch nail                  0.1   0.25           Yes   
2           3        hammer                  3.0   5.50            No   
3           4   screwdriver                  2.5   3.00            No   

  Is taxed?  
0       Yes  
1       Yes  
2       Yes  
3       Yes  


### Adding a Column III
Finally, you can add a new column by performing a function on the existing columns.

Add a column to df called 'Revenue', which is equal to the difference between the Price and the Cost to Manufacture.

In [3]:
df['Revenue'] = df.Price - df['Cost to Manufacture']

print(df)

   Product ID   Description  Cost to Manufacture  Price Sold in Bulk?  \
0           1  3 inch screw                  0.5   0.75           Yes   
1           2   2 inch nail                  0.1   0.25           Yes   
2           3        hammer                  3.0   5.50            No   
3           4   screwdriver                  2.5   3.00            No   

  Is taxed?  Revenue  
0       Yes     0.25  
1       Yes     0.15  
2       Yes     2.50  
3       Yes     0.50  


### Performing Column Operations
In the previous exercise, we learned how to add columns to a DataFrame.

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition.

We can use the apply function to apply a function to every value in a particular column. For example, this code overwrites the existing 'Name' columns by applying the function upper to every row in 'Name'.

In [4]:
# from string import upper

# df['Name'] = df.Name.apply(upper)

Apply the function lower to all names in column 'Name' in df. Assign these new names to a new column of df called 'Lowercase Name'. 

In [5]:
# df['Lowercase Name'] = df.Name.apply(lower)

### Reviewing Lambda Function
A lambda function is a way of defining a function in a single line of code. Usually, we would assign them to a variable.

For example, the following lambda function multiplies a number by 2 and then adds 3:

In [6]:
mylambda = lambda x: (x * 2) + 3
print(mylambda(5))

13


Lambda functions work with all types of variables, not just integers! Here is an example that takes in a string, assigns it to the temporary variable x, and then converts it into lowercase:

In [7]:
stringlambda = lambda x: x.lower()
print(stringlambda("Oh Hi Mark!"))

oh hi mark!


Create a lambda function mylambda that returns the first and last letters of a string, assuming the string is at least 2 characters long.

In [8]:
mylambda = lambda x: x[:1] + x[-1:]

print(mylambda('This is a string'))

Tg


### Reviewing Lambda Function: If Statements
We can make our lambdas more complex by using a modified form of an if statement.

Suppose we want to pay workers time-and-a-half for overtime (any work above 40 hours per week). The following function will convert the number of hours into time-and-a-half hours using an if statement:

In [9]:
def myfunction(x):
    if x > 40:
        return 40 + (x - 40) * 1.50
    else:
        return x

Below is a lambda function that does the same thing:

In [10]:
myfunction = lambda x: 40 + (x - 40) * 1.50 \
    if x > 40 else x

In general, the syntax for an if function in a lambda function is:

In [11]:
# lambda x: [OUTCOME IF TRUE] \
#     if [CONDITIONAL] \
#     else [OUTCOME IF FALSE]

You are managing the webpage of a somewhat violent video game and you want to check that each user's age is 13 or greater when they visit the site.

Write a lambda function that takes an inputted age and either returns Welcome to BattleCity! if the user is 13 or older or You must be over 13 if they are younger than 13. Your lambda function should be called mylambda.

In [12]:
mylambda = lambda x: 'Welcome to BattleCity!'\
    if x >= 13 else 'You must be over 13'

### Applying a Lambda to a Column
In Pandas, we often use lambda functions to perform complex operations on columns. For example, suppose that we want to create a column containing the email provider for each email address.

We could use the following code with a lambda function:

In [13]:
# df['Email Provider'] = df.Email.apply(
#     lambda x: x.split('@')[-1]
#     )

Create a lambda function get_last_name which takes a string with someone's first and last name (i.e., John Smith), and returns just the last name (i.e., Smith).

The DataFrame df represents the hours worked by different employees over the course of the week. It contains the following columns:

'name': The employee's name
'hourly_wage': The employee's hourly wage
'hours_worked': The number of hours worked this week
Use the lambda function get_last_name to create a new column last_name with only the employees' last name.

In [14]:
df = pd.read_csv('employees.csv')

df['last_name'] = df.name.apply(lambda x: x.split(' ')[-1])

print(df)

       id               name  hourly_wage  hours_worked  last_name
0   10310      Lauren Durham           19            43     Durham
1   18656      Grace Sellers           17            40    Sellers
2   61254  Shirley Rasmussen           16            30  Rasmussen
3   16886        Brian Rojas           18            47      Rojas
4   89010    Samantha Mosley           11            38     Mosley
5   87246       Louis Guzman           14            39     Guzman
6   20578     Denise Mcclure           15            40    Mcclure
7   12869      James Raymond           15            32    Raymond
8   53461       Noah Collier           18            35    Collier
9   14746    Donna Frederick           20            41  Frederick
10  71127       Shirley Beck           14            32       Beck
11  92522    Christina Kelly            8            44      Kelly
12  22447        Brian Noble           11            39      Noble
13  61654          Randy Key           16            38       

### Applying a Lambda to a Row
We can also operate on multiple columns at once. If we use apply without specifying a single column and add the argument axis=1, the input to our lambda function will be an entire row, not a column. To access particular values of the row, we use the syntax row.column_name or row[‘column_name’].

If an employee worked for more than 40 hours, she needs to be paid overtime (1.5 times the normal hourly wage).

For instance, if an employee worked for 43 hours and made \$10/hour, she would receive \$400 for the first 40 hours that she worked, and an additional \$45 for the 3 hours of overtime, for a total for $445.

Create a lambda function total_earned that accepts an input row with keys hours_worked and hourly_wage and uses an if statement to calculate the hourly wage.

Use the lambda function total_earned and apply to add a column total_earned to df with the total amount earned by each employee.

In [15]:
df = pd.read_csv('employees.csv')

total_earned = lambda row: (40*row['hourly_wage']) + (row['hours_worked']-40)*(row['hourly_wage']*1.5) \
    if row['hours_worked'] > 40 \
    else ( row['hours_worked'] * row['hourly_wage'] )
  
df['total_earned'] = df.apply(total_earned, axis=1)

print(df)

       id               name  hourly_wage  hours_worked  total_earned
0   10310      Lauren Durham           19            43         845.5
1   18656      Grace Sellers           17            40         680.0
2   61254  Shirley Rasmussen           16            30         480.0
3   16886        Brian Rojas           18            47         909.0
4   89010    Samantha Mosley           11            38         418.0
5   87246       Louis Guzman           14            39         546.0
6   20578     Denise Mcclure           15            40         600.0
7   12869      James Raymond           15            32         480.0
8   53461       Noah Collier           18            35         630.0
9   14746    Donna Frederick           20            41         830.0
10  71127       Shirley Beck           14            32         448.0
11  92522    Christina Kelly            8            44         368.0
12  22447        Brian Noble           11            39         429.0
13  61654          R

### Renaming Columns
When we get our data from other sources, we often want to change the column names. For example, we might want all of the column names to follow variable name rules, so that we can use df.column_name (which tab-completes) rather than df['column_name'] (which takes up extra space).

You can change all of the column names at once by setting the .columns property to a different list. This is great when you need to change all of the column names at once, but be careful! You can easily mislabel columns if you get the ordering wrong

The DataFrame df contains data about movies from IMDb.

We want to present this data to some film producers. Right now, our column names are in lower case, and are not very descriptive. Let's modify df using the .columns attribute to make the following changes to the columns: (not shown here)

In [16]:
df = pd.read_csv('imdb.csv')

df.columns = ['ID', 'Title', 'Category', 'Year Released', 'Rating']

print(df)

      ID                                              Title Category  \
0      1                                             Avatar   action   
1      2                                     Jurassic World   action   
2      3                                       The Avengers   action   
3      4                                    The Dark Knight   action   
4      5          Star Wars: Episode I - The Phantom Menace   action   
5      6                                          Star Wars   action   
6      7                            Avengers: Age of Ultron   action   
7      8                              The Dark Knight Rises   action   
8      9          Pirates of the Caribbean: Dead Mans Chest   action   
9     10                                         Iron Man 3   action   
10    11                                         Spider-Man   action   
11    12                Transformers: Revenge of the Fallen   action   
12    13       Star Wars: Episode III - Revenge of the Sith   ac

### Renaming Columns II
You also can rename individual columns by using the .rename method. Pass a dictionary like the one below to the columns keyword argument:

In [17]:
# {'old_column_name1': 'new_column_name1', 'old_column_name2': 'new_column_name2'}

Here's an example:

In [18]:
'''
df = pd.DataFrame({
    'name': ['John', 'Jane', 'Sue', 'Fred'],
    'age': [23, 29, 21, 18]
})
df.rename(columns={
    'name': 'First Name',
    'age': 'Age'},
    inplace=True)
'''

"\ndf = pd.DataFrame({\n    'name': ['John', 'Jane', 'Sue', 'Fred'],\n    'age': [23, 29, 21, 18]\n})\ndf.rename(columns={\n    'name': 'First Name',\n    'age': 'Age'},\n    inplace=True)\n"

In [19]:
df = pd.read_csv('imdb.csv')

df.rename(columns={
    'name' : 'movie_title'},
    inplace=True)

print(df)

      id                                        movie_title   genre  year  \
0      1                                             Avatar  action  2009   
1      2                                     Jurassic World  action  2015   
2      3                                       The Avengers  action  2012   
3      4                                    The Dark Knight  action  2008   
4      5          Star Wars: Episode I - The Phantom Menace  action  1999   
5      6                                          Star Wars  action  1977   
6      7                            Avengers: Age of Ultron  action  2015   
7      8                              The Dark Knight Rises  action  2012   
8      9          Pirates of the Caribbean: Dead Mans Chest  action  2006   
9     10                                         Iron Man 3  action  2013   
10    11                                         Spider-Man  action  2002   
11    12                Transformers: Revenge of the Fallen  action  2009   

### Review
Great job! In this lesson, you learned how to modify an existing DataFrame. Some of the skills you've learned include:

Adding columns to a DataFrame
Using lambda functions to calculate complex quantities
Renaming columns
Let's practice what you just learned!

Once more, you'll be the data analyst for ShoeFly.com, a fictional online shoe store.

More messy order data has been loaded into the variable orders. Examine the first 5 rows of the data using print and head.

Many of our customers want to buy vegan shoes (shoes made from materials that do not come from animals). Add a new column called shoe_source, which is vegan if the materials is not leather and animal otherwise.

Our marketing department wants to send out an email to each customer. Using the columns last_name and gender create a column called salutation which contains Dear Mr. <last_name> for men and Dear Ms. <last_name> for women.

In [22]:
'''
orders = pd.read_csv('shoefly.csv')

orders['shoe_source'] = orders.apply(lambda row: 'vegan'
    if row['shoe_material'] != 'leather'
    else 'animal',
    axis=1
)

orders['salutation'] = orders.apply(lambda row: 'Dear Mr. ' + row['last_name']
    if row['gender'] == 'male'
    else 'Dear Ms. ' + row['last_name'],        
    axis=1
)
print(orders.head(5))
'''

"\norders = pd.read_csv('shoefly.csv')\n\norders['shoe_source'] = orders.apply(lambda row: 'vegan'\n    if row['shoe_material'] != 'leather'\n    else 'animal',\n    axis=1\n)\n\norders['salutation'] = orders.apply(lambda row: 'Dear Mr. ' + row['last_name']\n    if row['gender'] == 'male'\n    else 'Dear Ms. ' + row['last_name'],        \n    axis=1\n)\nprint(orders.head(5))\n"