# pandas Skill Drills

At the beginning of class, we told you about DataFrames. They have named columns and one datatype per column. The [pandas](https://pandas.pydata.org/) library is our main support for DataFrames in python. This skill drill will teach you standard operations on DataFrames with pandas.

*Note:* In exercises that load data, you'll notice we're loading the dataset in every exercise. This isn't necessary when you're working with your own data later on in this class, but for now we want to make sure you're starting each exercise with a clean version of the data so any errors don't compound.

*Note:* In a Jupyter notebook like this, the value in the last line of a code cell will be printed, even without a `print()` statement. We have you use `print()` statements in every exercise because it will allow you to print multiple values from a cell when you need to.

## Table of Contents

A. [DataFrames](#A.-DataFrames)

B. [Basic indexing](#B.-Basic-indexing)

C. [Advanced indexing](#C.-Advanced-indexing)

D. [Modifying DataFrames](#D.-Modifying-DataFrames)

E. [Combining DataFrames](#E.-Combining-DataFrames)

Whenever you use pandas, you first need to import it with the following line of code. We'll also import NumPy because we'll be using it, too. Run the following block of code to get started:

In [2]:
import pandas as pd
import numpy as np

---
## A. DataFrames

The main datatypes when using pandas are the [*Series*](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) and the [*DataFrame*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

A Series is essentially a wrapper for a one-dimensional NumPy array. You can perform many operations on a Series just like you would with a NumPy array. The main difference is that it has an extra label/index for accessing each element, which we will learn about later on.

### Goal: Create a pandas Series object.

<span style="color:blue">**Exercise A1.**</span>

**Input**: a Python list

```
['a', 'b', 'c', 'd', 'e', 'f']
```
 
**Output**: Convert the list to a pandas Series using the [pd.Series() function](https://pandas.pydata.org/docs/reference/api/pandas.Series.html). Print the Series. The numbers that appear in the left column are each row's label.

```
0    a
1    b
2    c
3    d
4    e
5    f
dtype: object
```

In [3]:
# Given code
l = ['a', 'b', 'c', 'd', 'e', 'f']

# your code here
print(pd.Series(data=l))

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object


----
### Goal: Operate on a pandas Series all at once.

<span style="color:blue">**Exercise A2.**</span>

**Input**: a pandas Series of integers:

```
0     0
1    10
2    20
3    30
4    40
dtype: int64
```
 
**Output**: Create a new Series object by multiplying the first Series by 2. You can perform arithmetic directly on a Series just like we did when we used broadcasting on NumPy arrays. Print this new Series.

```
0     0
1    20
2    40
3    60
4    80
dtype: int64
```

In [4]:
# Given code
s = pd.Series([0, 10, 20, 30, 40])

# your code here

print(s)

0     0
1    10
2    20
3    30
4    40
dtype: int64


----
### Goal: Create a pandas DataFrame object.

A DataFrame is a two-dimensional table of data. A DataFrame has rows and columns, where each column is a Series. Columns can have different datatypes from each other.

<span style="color:blue">**Exercise A3.**</span>

**Input**: a numpy array

```
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]
```
 
**Output**: Convert the numpy array to a pandas DataFrame with the [`pd.DataFrame()` function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Use the `columns` argument to name the first column 'x' and the second column 'y'. Print the resulting DataFrame.

```
    x   y
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
5  10  11
```

In [5]:
# Given code
A = np.arange(12).reshape((6, 2))

# your code here

print(pd.DataFrame(data = A, columns = ['x', 'y']))


    x   y
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
5  10  11


---
### Goal: Load a CSV file into a DataFrame. Use `.head()` to see the first few rows of a dataframe.

<span style="color:blue">**Exercise A4.**</span>

**Input**: None
 
**Output**: Create a new DataFrame by reading in the `forest-fires.csv` file. Use the DataFrame's [`.head()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) to view the first five rows.

```
   Temperature  Area
0         18.0  0.36
1         21.7  0.43
2         21.9  0.47
3         23.3  0.55
4         21.2  0.61
```

In [24]:
# your code here

df=pd.read_csv('forest-fires.csv')

print(df.head(n=5)) 



   Temperature  Area
0         18.0  0.36
1         21.7  0.43
2         21.9  0.47
3         23.3  0.55
4         21.2  0.61


---
## B. Basic indexing

### Goal: Access rows using indices and labels. Understand when to use `.iloc` vs `.loc.`

Each row of a DataFrame has a label, which is like a special series that identifies each row of the DataFrame. If you don't specify a label when creating the DataFrame, the labels default to sequential integers. But DataFrames can have other labels as well, such as strings.

There are two main ways of accessing rows of a DataFrame:
- `df.iloc` accepts integer indices. For example, `df.iloc[10]` returns the row with index 10. You're familiar with this kind of indexing from NumPy arrays.
- `df.loc` accepts labels and boolean arrays. In subsequent exercises, we'll see how `df.loc` and `df.iloc` can yield different results. **We will most often be using `df.loc` in this class.**

<span style="color:blue">**Exercise B1.**</span> Using integer labels to select rows.

**Input**: A DataFrame of city populations indexed by integers:

```
df = 
       City  Population
0    Ithaca       32108
1  New York     8804190
2    Boston      675647
```
 
**Output**: 

Part a: Use [`df.iloc` indexing](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) to print the row with index 1.

```
City          New York
Population     8804190
Name: 1, dtype: object
```

Part b: Use [`df.loc` indexing](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) to print the row with label 1. This will work because pandas defaults to integer labels when labels are not specified.

```
City          New York
Population     8804190
Name: 1, dtype: object
```

In [31]:
# Given code
df = pd.DataFrame({'City': ['Ithaca', 'New York', 'Boston'], 'Population': [32108, 8804190, 675647]})


# part a. your code here
print(df.iloc[1])

# part b. your code here

print(df.loc[1])

City          New York
Population     8804190
Name: 1, dtype: object
City          New York
Population     8804190
Name: 1, dtype: object


---
<span style="color:blue">**Exercise B2.**</span> Using string labels to select rows.

**Input**: A DataFrame of city populations indexed by city name. Note there are no integers printed on the left of each row now.

```
df = 
          Population
Ithaca         32108
New York     8804190
Boston        675647
```
 
**Output**: 

Part a: Use [`df.iloc` indexing](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) to print the row with index 1.

```
City          New York
Population     8804190
Name: 1, dtype: object
```

Part b: Use [`df.loc` indexing](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) to print the row with label 'New York'. Note that trying `df.loc[1]`, as in the previous exercise, will not work because here the rows are indexed by city name instead of an integer.

```
City          New York
Population     8804190
Name: 1, dtype: object
```

In [37]:
# Given code
df = pd.DataFrame({'Population': [32108, 8804190, 675647]}, index=['Ithaca', 'New York', 'Boston'])


# part a. your code here
print(df.iloc[1])

# part b. your code here
print(df.loc['New York'])


Population    8804190
Name: New York, dtype: int64
Population    8804190
Name: New York, dtype: int64


---
<span style="color:blue">**Exercise B3.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Use `df.iloc` to print the row with index 100.

```
Temperature    15.40
Area           10.13
Name: 100, dtype: float64
```

In [38]:
# Given code
df = pd.read_csv('forest-fires.csv')

# your code here
print(df.iloc[100])

Temperature    15.40
Area           10.13
Name: 100, dtype: float64


---
### Goal: Use slicing to access multiple rows of a DataFrame.

<span style="color:blue">**Exercise B4.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Use `df.iloc` and slicing to print all rows from index 10 to 19 (inclusive).

```
    Temperature  Area
10         17.4  1.07
11         23.7  1.12
12         23.2  1.19
13         24.8  1.36
14         24.6  1.43
15         20.1  1.46
16         29.6  1.46
17         16.4  1.56
18         28.6  1.61
19         18.4  1.63
```

In [61]:
# Given code
df = pd.read_csv('forest-fires.csv')

# your code here
print(df.iloc[10:20])


0      18.0
1      21.7
2      21.9
3      23.3
4      21.2
       ... 
263    21.1
264    18.2
265    27.8
266    21.9
267    21.2
Name: Temperature, Length: 268, dtype: float64


----
## C. Advanced indexing

### Goal: Access an entire column of a DataFrame.

<span style="color:blue">**Exercise C1.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Use indexing to create a new series variable that contains the DataFrame's `Temperature` column. Print this series.

```
0      18.0
1      21.7
2      21.9
3      23.3
4      21.2
       ... 
263    21.1
264    18.2
265    27.8
266    21.9
267    21.2
Name: Temperature, Length: 268, dtype: float64
```

In [62]:
# Given code
df = pd.read_csv('forest-fires.csv')

# your code here

print(df.loc[:,"Temperature"])


0      18.0
1      21.7
2      21.9
3      23.3
4      21.2
       ... 
263    21.1
264    18.2
265    27.8
266    21.9
267    21.2
Name: Temperature, Length: 268, dtype: float64


----
<span style="color:blue">**Exercise C2.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Use indexing to create a new series variable that contains the DataFrame's `Area` column. Print this series.

```
0       0.36
1       0.43
2       0.47
3       0.55
4       0.61
       ...  
263     2.17
264     0.43
265     6.44
266    54.29
267    11.16
Name: Area, Length: 268, dtype: float64
```

In [63]:
# Given code
df = pd.read_csv('forest-fires.csv')

# your code here
print(df.loc[:,"Area"])


0       0.36
1       0.43
2       0.47
3       0.55
4       0.61
       ...  
263     2.17
264     0.43
265     6.44
266    54.29
267    11.16
Name: Area, Length: 268, dtype: float64


---
### Goal: Check how many rows match a condition using `.sum()`.

Summing a boolean Series yields the number of entries in the Series that are `True`. You can catch a lot of errors when using boolean indexing if you always count the number of selected rows when doing any DataFrame manipulation.

<span style="color:blue">**Exercise C3.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Use boolean indexing to select the rows corresponding to fires with temperatures less than 6 degrees.
Create a boolean Series where each entry corresponds to whether the fire with the given index has a temperature less than 6 degrees. Print the number of `True` values in this series by using `.sum()`. This should be your output:

```
16
```

In [68]:
# Given code
df = pd.read_csv('forest-fires.csv')

# your code here
s = df.loc[:,"Temperature"] < 6 

print(s.sum(True in s))




16


---
### Goal: Select rows from a DataFrame based on values in the columns.

<span style="color:blue">**Exercise C4.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Use logical indexing to select the rows corresponding to fires with temperatures less than 6 degrees.

Part a: Create a boolean series where each entry corresponds to whether the fire with the given index has a temperature less than 6 degrees. Print this series.

```
0      False
1      False
2      False
3      False
4      False
       ...  
263    False
264    False
265    False
266    False
267    False
Name: Temperature, Length: 268, dtype: bool
```

Part b: Use `df.loc` indexing to create a new DataFrame that only contains the rows corresponding to these fires. Print the resulting smaller DataFrame:

```
     Temperature   Area
27           5.3   2.14
38           5.8   4.61
58           5.8  10.93
73           5.1  26.00
125          4.8   8.98
126          5.1  11.19
127          5.1   5.38
128          4.6  17.85
129          4.6  10.73
130          4.6  22.03
131          4.6   9.77
132          2.2   9.27
133          5.1  24.77
231          4.6   5.39
232          5.1   2.14
233          4.6   6.84
```

In [77]:
# Given code
df = pd.read_csv('forest-fires.csv')

# part a. your code here
s = df.loc[:,"Temperature"] < 6 
print(s)

# part b. your code here

print(df.loc[df.loc[:,"Temperature"] < 6 ,:])

0      False
1      False
2      False
3      False
4      False
       ...  
263    False
264    False
265    False
266    False
267    False
Name: Temperature, Length: 268, dtype: bool
     Temperature   Area
27           5.3   2.14
38           5.8   4.61
58           5.8  10.93
73           5.1  26.00
125          4.8   8.98
126          5.1  11.19
127          5.1   5.38
128          4.6  17.85
129          4.6  10.73
130          4.6  22.03
131          4.6   9.77
132          2.2   9.27
133          5.1  24.77
231          4.6   5.39
232          5.1   2.14
233          4.6   6.84


----
<span style="color:blue">**Exercise C5.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Use logical indexing to select the rows corresponding to fires with an area greater than 50.

Part a: Create a boolean series where each entry corresponds to whether the fire with the given index has an area greater than 50. Print this series.

```
0      False
1      False
2      False
3      False
4      False
       ...  
263    False
264    False
265    False
266     True
267    False
Name: Area, Length: 268, dtype: bool
```

Part b: Use `df.loc` indexing to create a new DataFrame that only contains the rows corresponding to these fires. Print the resulting smaller DataFrame:

```
     Temperature    Area
89          20.1   58.30
90          28.3   64.10
91          16.4   71.30
92          26.4   88.49
93          27.8   95.18
94          18.7  103.39
95          24.3  105.66
96          17.7  154.88
97          19.6  196.48
98          18.2  200.94
99          18.8  212.88
138         26.9   86.45
158         23.0   56.04
186         21.9  174.63
197         21.9   70.76
198         10.1   51.78
211         26.2  185.76
227         19.9   82.75
237         13.7   61.13
240         24.5   70.32
246         22.6  278.53
266         21.9   54.29
```

In [79]:
# Given code
df = pd.read_csv('forest-fires.csv')

# part a. your code here

s = df.loc[:,"Area"] > 50 
print(s)

# part b. your code here

print(df.loc[s,:])

0      False
1      False
2      False
3      False
4      False
       ...  
263    False
264    False
265    False
266     True
267    False
Name: Area, Length: 268, dtype: bool
     Temperature    Area
89          20.1   58.30
90          28.3   64.10
91          16.4   71.30
92          26.4   88.49
93          27.8   95.18
94          18.7  103.39
95          24.3  105.66
96          17.7  154.88
97          19.6  196.48
98          18.2  200.94
99          18.8  212.88
138         26.9   86.45
158         23.0   56.04
186         21.9  174.63
197         21.9   70.76
198         10.1   51.78
211         26.2  185.76
227         19.9   82.75
237         13.7   61.13
240         24.5   70.32
246         22.6  278.53
266         21.9   54.29


---
<span style="color:blue">**Exercise C6.**</span> Logical indexing with multiple conditions.

Using multiple conditions when indexing a DataFrame is very useful but syntactically confusing. We're going to have you do it in multiple steps with multiple variables first. Why multiple steps? Otherwise the single line of code looks like a magical formula. Once you get the individual steps, we'll have you combine the steps into one line of code that you can use later. A single line is also how you'll often see DataFrame indexing written. We want to show you how to recognize overall patterns that are getting combined for complex behavior.

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Use logical indexing to select the rows corresponding to fires with an area between 20 and 30 (inclusive).

Part a: Create a boolean series variable called `low_area_selector` where each entry corresponds to whether the fire with the given index has an area greater than or equal to 20. Create a second boolean series called `high_area_selector` where each entry corresponds to whether the fire with the given index has an area less than or equal to 30. Finally, create a third series called `selector` by taking the boolean *and* of the first two series (which can be done with the `&` operator). Print this final series.

```
0      False
1      False
2      False
3      False
4      False
       ...  
263    False
264    False
265    False
266    False
267    False
Name: Area, Length: 268, dtype: bool
```

Part b: Use `df.loc` indexing to create a new DataFrame that only contains the rows corresponding to these fires. Print the resulting smaller DataFrame:

```
     Temperature   Area
71          23.2  23.41
72          18.4  24.23
73           5.1  26.00
74          20.1  26.13
75          11.0  27.35
76          17.0  28.66
77          17.0  28.66
78          16.9  29.48
130          4.6  22.03
133          5.1  24.77
135          7.5  24.24
165         21.3  28.19
180         19.6  20.03
193         20.6  24.59
194         23.3  28.74
249         33.1  26.43
```

Part c: We've had you perform this indexing using multiple lines of code. Rewrite the previous part to use only one line of code. You will need to use parentheses around the boolean predicates, otherwise you will get an error. It's always good to be able to recognize error messages, so try this part without the parentheses first. Print the resulting dataframe. The output should be the same as in part b:

```
     Temperature   Area
71          23.2  23.41
72          18.4  24.23
73           5.1  26.00
74          20.1  26.13
75          11.0  27.35
76          17.0  28.66
77          17.0  28.66
78          16.9  29.48
130          4.6  22.03
133          5.1  24.77
135          7.5  24.24
165         21.3  28.19
180         19.6  20.03
193         20.6  24.59
194         23.3  28.74
249         33.1  26.43
```

In [92]:
# Given code
df = pd.read_csv('forest-fires.csv')

# part a. your code here
low_area_selector = pd.Series(data = df.loc[:,"Area"]>= 20) 
high_area_selector = pd.Series(data = df.loc[:,"Area"]<= 30)

selector =  high_area_selector & low_area_selector

print(selector) 

# part b. your code here

s = df.loc[selector,:]
print(s)

# part c. your code here


print(df.loc[ (pd.Series(data = df.loc[:,"Area"]>= 20)) & (pd.Series(data = df.loc[:,"Area"]<= 30)),:])


0      False
1      False
2      False
3      False
4      False
       ...  
263    False
264    False
265    False
266    False
267    False
Name: Area, Length: 268, dtype: bool
     Temperature   Area
71          23.2  23.41
72          18.4  24.23
73           5.1  26.00
74          20.1  26.13
75          11.0  27.35
76          17.0  28.66
77          17.0  28.66
78          16.9  29.48
130          4.6  22.03
133          5.1  24.77
135          7.5  24.24
165         21.3  28.19
180         19.6  20.03
193         20.6  24.59
194         23.3  28.74
249         33.1  26.43
     Temperature   Area
71          23.2  23.41
72          18.4  24.23
73           5.1  26.00
74          20.1  26.13
75          11.0  27.35
76          17.0  28.66
77          17.0  28.66
78          16.9  29.48
130          4.6  22.03
133          5.1  24.77
135          7.5  24.24
165         21.3  28.19
180         19.6  20.03
193         20.6  24.59
194         23.3  28.74
249         33.1  26.43


----
<span style="color:blue">**Exercise C7.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Use logical indexing to select the rows corresponding to fires that have *both* a temperature less than 10 and an area less than 20.

Part a: Create a boolean series called `area_selector` where each entry corresponds to whether the fire with the given index has an area less than 20. Create a second boolean series called `temp_selector` where each entry corresponds to whether the fire with the given index has a temperature less than 10. Finally, create a third series called `selector` by taking the boolean *and* of the first two series. Print this final series.

```
0      False
1      False
2      False
3      False
4      False
       ...  
263    False
264    False
265    False
266    False
267    False
Length: 268, dtype: bool
```

Part b: Use `df.loc` indexing to create a new DataFrame that only contains the rows corresponding to these fires. Print the resulting smaller DataFrame:

```
     Temperature   Area
27           5.3   2.14
38           5.8   4.61
58           5.8  10.93
64           8.8  13.05
125          4.8   8.98
126          5.1  11.19
127          5.1   5.38
128          4.6  17.85
129          4.6  10.73
131          4.6   9.77
132          2.2   9.27
134          8.8   1.10
195          7.5   9.96
206          8.2   4.62
231          4.6   5.39
232          5.1   2.14
233          4.6   6.84
```

Part c: As in the previous exercise, combine all of the above into one line. Remember to surround each predicate with parentheses. Print the final DataFrame. It should be the same as in part b. 

```
     Temperature   Area
27           5.3   2.14
38           5.8   4.61
58           5.8  10.93
64           8.8  13.05
125          4.8   8.98
126          5.1  11.19
127          5.1   5.38
128          4.6  17.85
129          4.6  10.73
131          4.6   9.77
132          2.2   9.27
134          8.8   1.10
195          7.5   9.96
206          8.2   4.62
231          4.6   5.39
232          5.1   2.14
233          4.6   6.84
```

In [103]:
# Given code
df = pd.read_csv('forest-fires.csv')

# part a. your code here
area_selector = pd.Series(data=df.loc[:, "Area"]<20)
temp_selector = pd.Series(data = df.loc[:, "Temperature"] <10 )
selector = area_selector & temp_selector 

print(selector)

# part b. your code here

s = df.loc[selector,:]
print(s)


# part c. your code here

print 


print(df.loc[(pd.Series(data=df.loc[:, "Area"]<20)) & (pd.Series(data = df.loc[:, "Temperature"] <10 )), : ])





0      False
1      False
2      False
3      False
4      False
       ...  
263    False
264    False
265    False
266    False
267    False
Length: 268, dtype: bool
     Temperature   Area
27           5.3   2.14
38           5.8   4.61
58           5.8  10.93
64           8.8  13.05
125          4.8   8.98
126          5.1  11.19
127          5.1   5.38
128          4.6  17.85
129          4.6  10.73
131          4.6   9.77
132          2.2   9.27
134          8.8   1.10
195          7.5   9.96
206          8.2   4.62
231          4.6   5.39
232          5.1   2.14
233          4.6   6.84
     Temperature   Area
27           5.3   2.14
38           5.8   4.61
58           5.8  10.93
64           8.8  13.05
125          4.8   8.98
126          5.1  11.19
127          5.1   5.38
128          4.6  17.85
129          4.6  10.73
131          4.6   9.77
132          2.2   9.27
134          8.8   1.10
195          7.5   9.96
206          8.2   4.62
231          4.6   5.39
232          5.1

----
### Goal: Summarize a Series with `.mean()`.

<span style="color:blue">**Exercise C8.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Print the mean temperature of all 268 fires. This should be your output:

```
19.258955223880594
```


In [113]:
# Given code
df = pd.read_csv('forest-fires.csv')

# your code here
# df.mean(axis=df.loc[:,"Temperature"])

print(df["Temperature"].mean()) 

19.25895522388061


----
<span style="color:blue">**Exercise C9.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Print the mean area of all 268 fires. This should be your output:

```
17.92884328358209
```


In [114]:
# Given code
df = pd.read_csv('forest-fires.csv')

# your code here
print(df["Area"].mean())

17.928843283582083


----
### Goal: Summarize a Series with `.sum()`.

<span style="color:blue">**Exercise C10.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Print the total area of all 268 fires. This should be your output:

```
4804.93
```


In [115]:
# Given code
df = pd.read_csv('forest-fires.csv')

# your code here
print(df["Area"].sum())


4804.93


---
## D. Modifying DataFrames

### Goal: create a new DataFrame column as a function of an existing column.

<span style="color:blue">**Exercise D1.**</span> Create a new column using arithmetic on another column.

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: The `Temperature` column in the DataFrame is in degrees Celsius. Add a new column to the DataFrame named `Fahrenheit Temperature` that stores the corresponding Fahrenheit temperature for each row. The conversion formula is `F = C * 9/5 + 32`.  For example, the first row's temperature value of `18.0` should become `64.40` because `64.40 = 18.0 * 9/5 + 32`. Print the resulting DataFrame.

```
     Temperature   Area  Fahrenheit Temperature
0           18.0   0.36                   64.40
1           21.7   0.43                   71.06
2           21.9   0.47                   71.42
3           23.3   0.55                   73.94
4           21.2   0.61                   70.16
..           ...    ...                     ...
263         21.1   2.17                   69.98
264         18.2   0.43                   64.76
265         27.8   6.44                   82.04
266         21.9  54.29                   71.42
267         21.2  11.16                   70.16

[268 rows x 3 columns]
```


In [118]:
# Given code
df = pd.read_csv('forest-fires.csv')

# your code here

df["Fahrenheit Temperature"] = df.loc[:,"Temperature"] * 9/5 + 32

print(df)

     Temperature   Area  Fahrenheit Temperature
0           18.0   0.36                   64.40
1           21.7   0.43                   71.06
2           21.9   0.47                   71.42
3           23.3   0.55                   73.94
4           21.2   0.61                   70.16
..           ...    ...                     ...
263         21.1   2.17                   69.98
264         18.2   0.43                   64.76
265         27.8   6.44                   82.04
266         21.9  54.29                   71.42
267         21.2  11.16                   70.16

[268 rows x 3 columns]


----
<span style="color:blue">**Exercise D2.**</span> Create a column using `df.apply()` on an existing column.

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: Our ultimate goal is to add a column to the DataFrame that qualititatively records whether the temperature for each row is `hot` or `cold`. We're going to treat any Celsius temperature at or above 20 degrees as `hot`.

Part a: Write a function named `rate_temperature` that takes in a number and returns the string `'hot'` if the number is greater than or equal to 20.

```
cold
cold
hot
hot
```

Part b: Apply this function to each of the temperatures in the DataFrame using `df.apply()`. The call to `df.apply()` will return a series, which you can assign to a new column named `Qualitative Temperature` in the DataFrame. Print the resulting DataFrame:

```
Temperature   Area Qualitative Temperature
0           18.0   0.36                    cold
1           21.7   0.43                     hot
2           21.9   0.47                     hot
3           23.3   0.55                     hot
4           21.2   0.61                     hot
..           ...    ...                     ...
263         21.1   2.17                     hot
264         18.2   0.43                    cold
265         27.8   6.44                     hot
266         21.9  54.29                     hot
267         21.2  11.16                     hot

[268 rows x 3 columns]
```

In [166]:
# Given code
df = pd.read_csv('forest-fires.csv')

# part a. your code here
        
def rate_temperature(a):
        if a >= 20 : 
            return "hot"
            print ("hot" )
        
        else:
            return "cold"
    

# test the function
for i in [0, 10, 20, 30]:
    print(rate_temperature(i))

# part b. your code here

df["Qualitative Temperature"]= df["Temperature"].apply(rate_temperature) 

print(df)

cold
cold
hot
hot
     Temperature   Area Qualitative Temperature
0           18.0   0.36                    cold
1           21.7   0.43                     hot
2           21.9   0.47                     hot
3           23.3   0.55                     hot
4           21.2   0.61                     hot
..           ...    ...                     ...
263         21.1   2.17                     hot
264         18.2   0.43                    cold
265         27.8   6.44                     hot
266         21.9  54.29                     hot
267         21.2  11.16                     hot

[268 rows x 3 columns]


---
### Goal: Rename a column to have a more informative name.

The pandas DataFrame [`.rename()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) can be used to rename columns in the DataFrame. Instead of directly modifying the DataFrame, it returns a modified copy of the DataFrame. This copy must be stored in a variable for you to notice the changes later.

<span style="color:blue">**Exercise D3.**</span>

**Input**: A DataFrame of forest fire data:

```
df = 
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```
 
**Output**: 

Part a: Use the DataFrame's [`.rename()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html) to rename the 'Temperature' column in the DataFrame to have the more informative name 'Celsius Temperature'. To see the importance of saving the result of functions that do not operate in-place, *do not assign the result of this method to any variable.* Print the DataFrame. It should be identical to the input, with the original column names:

```
     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
```

Part b: Repeat the above step, this time storing the result of the call to `.rename()` in the `df` variable. Print the DataFrame:

```
     Celsius Temperature   Area
0                   18.0   0.36
1                   21.7   0.43
2                   21.9   0.47
3                   23.3   0.55
4                   21.2   0.61
..                   ...    ...
263                 21.1   2.17
264                 18.2   0.43
265                 27.8   6.44
266                 21.9  54.29
267                 21.2  11.16

[268 rows x 2 columns]
```


In [179]:
# Given code
df = pd.read_csv('forest-fires.csv')

# part a. your code here


df.rename(columns={'Temperature':"Celcius Temperature"})
print(df) 

# part b. your code here

cf = df.rename(columns={'Temperature':"Celcius Temperature"})

print(cf)

     Temperature   Area
0           18.0   0.36
1           21.7   0.43
2           21.9   0.47
3           23.3   0.55
4           21.2   0.61
..           ...    ...
263         21.1   2.17
264         18.2   0.43
265         27.8   6.44
266         21.9  54.29
267         21.2  11.16

[268 rows x 2 columns]
     Celcius Temperature   Area
0                   18.0   0.36
1                   21.7   0.43
2                   21.9   0.47
3                   23.3   0.55
4                   21.2   0.61
..                   ...    ...
263                 21.1   2.17
264                 18.2   0.43
265                 27.8   6.44
266                 21.9  54.29
267                 21.2  11.16

[268 rows x 2 columns]


---
## E. Combining DataFrames

### Goal: Combine data from two DataFrames using `.merge()`.

The pandas [`.merge()` function]() creates a modified copy of the DataFrame, just like `.rename()`.

<span style="color:blue">**Exercise E1.**</span>

**Input**: Two DataFrames, one which contains a list of cities with population, and the other which contains the area of different cities.

```
population_df =
       City  Population
0    Ithaca       32108
1  New York     8804190
2    Boston      675647

area_df = 
           City    Area
0        Ithaca    6.07
1  Philadelphia  142.70
2      New York  472.43
```
 
**Output**:

Part a: Use the `population_df` DataFrame's [`.merge()` method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) to merge both DataFrames. To see the importance of saving the result of functions that do not operate in-place, *do not assign the result of this method to any variable.* Print the `population_df` DataFrame. It should be identical to the input:

```
       City  Population
0    Ithaca       32108
1  New York     8804190
2    Boston      675647
```

Part b: Repeat the above step, this time storing the result of the call to `.merge()` in the `population_df` variable. Print this DataFrame, which will contain the populations and areas for the cities that appear in both `population_df` and `area_df`. This should be your output:

```
       City  Population    Area
0    Ithaca       32108    6.07
1  New York     8804190  472.43
```


In [189]:
# Given code
population_df = pd.DataFrame({'City': ['Ithaca', 'New York', 'Boston'], 'Population': [32108, 8804190, 675647]})
area_df = pd.DataFrame({'City': ['Ithaca', 'Philadelphia', 'New York'], 'Area': [6.07, 142.70, 472.43]})

# part a. your code here

population_df.merge(area_df, on = "City")
print(population_df)

#part b. your code here
pdf = population_df.merge(area_df, on = "City")
print (pdf) 


       City  Population
0    Ithaca       32108
1  New York     8804190
2    Boston      675647
       City  Population    Area
0    Ithaca       32108    6.07
1  New York     8804190  472.43


### Goal: Combine data from two DataFrames using `.concat()`.

The pandas [`pd.concat()` function](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) will vertically combine DataFrames. It creates a new DataFrame that must be stored in a variable, just like `.rename()` and `.merge()`. Special care needs to be taken with the indexes/labels of the rows in the combined DataFrame, as we'll see, because `pd.concat()` by default does not change the labels of rows.

<span style="color:blue">**Exercise E2.**</span>

**Input**: Two DataFrames, each of which contains a list of cities and their corresponding populations.
```
population_1_df =
       City  Population
0    Ithaca       32108
1  New York     8804190

population_2_df =
           City  Population
0        Boston      675647
1  Philadelphia     1603797

```
 
**Output**:

Part a: Our goal is to combine the two DataFrames into one DataFrame that contains all of the cities. Use the [`pd.concat()` function](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) create this new combined DataFrame. Note that `pd.concat()` is not a DataFrame method, so you will need to pass in a list containing both DataFrames as an argument. Print the resulting DataFrame. Note the repeated indices/labels in the left-hand column.

```
           City  Population
0        Ithaca       32108
1      New York     8804190
0        Boston      675647
1  Philadelphia     1603797
```

Part b: We most often want labels to be unique integers in increasing order. Repeat the above step, this time passing in the argument `ignore_index=True` to `pd.concat()`. Print the combined DataFrame, and note that we get the indices we expect:

```
           City  Population
0        Ithaca       32108
1      New York     8804190
2        Boston      675647
3  Philadelphia     1603797
```

In [193]:
# Given code
population_1_df = pd.DataFrame({'City': ['Ithaca', 'New York'], 'Population': [32108, 8804190]})
population_2_df = pd.DataFrame({'City': ['Boston', 'Philadelphia'], 'Population': [675647, 1603797]})

# part a. your code here



print(pd.concat([population_1_df, population_2_df]) ) 


# part b. your code here
print(pd.concat([population_1_df, population_2_df], ignore_index=True) ) 



           City  Population
0        Ithaca       32108
1      New York     8804190
0        Boston      675647
1  Philadelphia     1603797
           City  Population
0        Ithaca       32108
1      New York     8804190
2        Boston      675647
3  Philadelphia     1603797
