# pandas2

Let's import pandas!

In [None]:
import pandas as pd

<br><br>We're going to work with a new dataframe. This is a dataset from forest fires in NE Portugal. I have included the dataset as a csv file in today's materials, but the data is available publicly at this site: https://archive.ics.uci.edu/ml/datasets/Forest+Fires

<br><br>Everyone can now run the next line of code to create a DataFrame from our csv file.

In [None]:
df = pd.read_csv("forestfires.csv")

In [None]:
df.head()

Take a look at the new dataframe to get to know our data.

How many rows and columns are in our data?

## <br><br><br>PART THREE: RENAMING COLUMNS, DROPPING ROWS AND COLUMNS, SORTING DATA, AND SAVING DATAFRAMES

### <br>Renaming columns

Here's what our column names look like in the forest fire dataset:

In [None]:
df.head()

Four of the columns end in "\_code". Let's remove that part from the column names. We can use the `rename()` method. We need to pass the function a **dictionary** of the old name to be replaced as the key and the new name as the value.

In [None]:
df.rename(columns = {"moisture_code": "moisture", 
                     "fuel_code": "fuel"})

In [None]:
df.head()

Uh-oh, the change didn't stick. We've encountered this before with strings, so we know the answer - reassign it to a variable.

In [None]:
df = df.rename(columns = {"moisture_code": "moisture", 
                          "fuel_code": "fuel"})

In [None]:
df.head()

### <br><br>Exercise 1

Write code to remove "\_code" from the ends of the drought and initial_spread column names:

In [None]:
df.head()

### <br><br><br>Dropping rows and columns

Let's drop a single row from the DataFrame. How about row 2? You still have to assign `df` to a variable to make the change permanent:

In [None]:
df = df.drop(2)

In [None]:
df.head()

<br>The index numbers did not reset when we dropped a row. 2 is missing!

We can reset the index and pretend like 2 was never there. The `reset_index()` function takes one keyword argument. If we don't pass the argument, `drop=True`, an extra column will get added to our DataFrame containing the old index numbers. Let's first reset the index without passing the argument, but we won't save that DataFrame:

In [None]:
df.reset_index()

You can see that new column `index` contains the original index positions. Now let's save a new version of our DataFrame, with the indexes reset, but without that new column:

In [None]:
df = df.reset_index(drop=True)

In [None]:
df.head()

<br><br><br>The `drop()` function defaults to dropping rows. If we want to drop a column, we need to add one more argument. `axis=1` is used in pandas to refer to columns as opposed to rows (`axis=0`). The `axis` argument is used elsewhere in pandas, too. Let's drop the "X" column:

In [None]:
df = df.drop("X", axis=1)

In [None]:
df.head()

### <br><br>Exercise 2

Write code to view the last 5 rows of the DataFrame:

Now write code to drop the very last row:

In [None]:
df.tail()

Write code to remove the "Y" column:

In [None]:
df.head()

### <br><br><br>Sorting a DataFrame

There are two methods for sorting your DataFrame.

If you want to sort by the index numbers, or if you want to sort by the column names (alphabetically), you use `sort_index`. It can take two arguments: the axis to sort by (row or column) and the order (ascending or not):

The default arguments are to sort by row index with 0 at the top, which is how we've already been viewing the data:

In [None]:
df.sort_index()

Let's try more arguments:

In [None]:
df.sort_index(ascending=False)

In [None]:
df.sort_index(axis=1)

In [None]:
df.sort_index(axis=1, ascending=False)

<br><br><br>The second sort function, `sort_values()`, will sort the frame by the data in a column:

In [None]:
df.sort_values("area_burned")

In [None]:
df.sort_values("day")

### <br><br>Exercise 3

Write code to sort the DataFrame by the rain column, with the largest values at the top:

<br><br><br>You can also sort on multiple values by passing the `sort_values` function a list of column names instead of a single name. If we want to first sort by day, then by area burned:

In [None]:
df.sort_values(["day", "area_burned"])

### <br><br><br>Saving your changed DataFrame

We've made a lot of changes to the forest fire dataset. Let's save it as a new csv file. First, we can decide what we're going to call the new file:

In [None]:
new_filename = "fire_changed.csv"

Next, we can use the `to_csv()` method function to save the new file:

In [None]:
df.to_csv(new_filename)

### <br><br>Exercise 4

Remember what you learned from Part Two: Selecting Data. Create a new dataframe called df_rain that contains only the columns day, month, and rain: 

Run the next line of code to save a new filename for the rain dataframe:

In [None]:
rain_filename = "forestFiresRain.csv"

Write code to save the df_rain dataframe as a csv file:

## <br><br><br>PART FOUR: DATA AGGREGATION

Data aggregation means taking many data points and reducing them to one number, whether it's a count, sum, mean, or other single statistic. Here are some DataFrame method functions:

- [`.count()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html)
- [`.sum()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html)
- [`.mean()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)
- [`.median()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html)
- [`.min()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html)
- [`.max()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)
- [`.unique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html)
- [`.nunique()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html)
- [`.std()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)   #Standard error
- [`.var()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)   #Variance
- And more! https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats

If you use a method function on the entire dataset, it will try its best to execute the method for all columns.

In [None]:
df.count()

In [None]:
df.min()

In [None]:
df.sum()

In [None]:
df.unique()

<br>Not all functions will work on the entire DataFrame. Most of the time you are interested in only a subset of the data:

In [None]:
df["day"].unique()

In [None]:
list(df["day"].unique())

In [None]:
df["month"].nunique()

In [None]:
df["temp"].var()

### <br><br>Exercise 1

Write code to find the mean humidity of the dataset:

Write code to find the coldest temperature in the dataset:

## <br><br>groupby

Often, you will want to calculate the statistics for a particular subgroup of a data column.

For example, let's say we want to ask if more fires happen on certain days of the week. This code will tell you the count for every column in the DataFrame except the column that you are using to group your data (i.e. "day").

In [None]:
df.groupby("day").count()

<br>We can make this easier to read by sorting the rows from lowest count to highest count. We can string multiple method functions onto our code, as long as they go in the correct logical order. We will pick any column to sort on, since they are all the same.

In [None]:
df.groupby("day").count().sort_values("month")

<br>It looks like weekends have more fires than weekdays.

### <br><br>Exercise 2

Write code to find the mean for all the columns except `day` when grouped by month.

<br><br><br><br>If you only want to see an aggregate total for one column in the DataFrame, you can add on the indexing technique we learned yesterday. With this code I will ask, What is the mean area burned on each day of the week?

In [None]:
df.groupby("day")[["area_burned"]].mean()

<br>Again, we can also sort the data:

In [None]:
df.groupby("day")[["area_burned"]].mean().sort_values("area_burned")

<br>So Saturday fires are the most destructive fires.

<br>We can also add some other functions to our code, like round:

In [None]:
df.groupby("day")[["area_burned"]].mean().round(2).sort_values("area_burned")

### <br><br>Exercise 3

Write code to count how many fires happened in each month (as a bonus, sort the results):

Write code to see the mean area burned for fires in each month (as a bonus, round and sort the results):

That's the end of pandas workshop. There's additional material available in `part4-pandasBonus.ipynb` notebook, which covers visualization, working with missing data, 