# Data Wrangling

This is one of the hardest parts of working with data. Raw data is rarely ready to be analyzed. Scientists spend a lot of time manipulating their data to get it into a form that can be used. We call it "wrangling".

<img src = 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/52/Cats_in_aoshima_island_1.JPG/1600px-Cats_in_aoshima_island_1.JPG' width = 400>

Possible problems with data:
* Getting only the data you need 
* Incorrect data types
* Incorrect units

## Selecting Columns
The most common way to get data from a DataFrame is by selecting one or more columns. Remember, each column is a Pandas Series. Columns are much easier to work with because they are arrays, meaning all the elements in a column are the same datatype (or they should be).

We'll continue using the historical global temperatures dataset from [Berkeley Earth](https://berkeleyearth.org/about/). Since this is a new notebook, we need to import Pandas again and load the dataset.

In [None]:
import pandas as pd
global_temps = pd.read_csv('data/global_temps_by_city.csv')
global_temps.head()

### Selecting a Single Column
To select a single column, use square bracket notation with the column name as a string:

In [None]:
cities = global_temps['City']
print(cities)

### Selecting Multiple Columns
To select multiple columns, you need to pass a **list** of column names inside the square brackets. This will return a new dataframe.

The double brackets may seem odd, but it is because we are proving the desired columns as a list which itself is in brackets:

`list_of_columns = ['column_name1', 'column_name2, ...]` <br>
+<br>
`dataset_name[list_of_columns]`<br>
=<br>
`dataset_name[['column_name1', 'column_name2, ...]]`

We can filter certain columns by using brackets around the column names, but we need to pass it as a list of column names, which needs to be in its own set of brackets. This is how we reference columns: `dataset_name[['column_name1', 'column_name2, ...]]`

The double brackets may seem odd, but it is because we are proving the desired columns as a list which itself is in brackets:

`list_of_columns = ['column_name1', 'column_name2, ...]` <br>
+<br>
`dataset_name[list_of_columns]`<br>
=<br>
`dataset_name[['column_name1', 'column_name2, ...]]`

In [None]:
city_temps = global_temps[['AverageTemperature', 'City', 'Country']]
city_temps

## Selecting Rows

Rows are a little trickier to reference because they are not arrays. In columns, we can reference the name of the columns that we are interested in. Also, the elements in a column are the same data type. Since rows contain the elements of several different arrays, they can contain multiple data types and require a different way of thinking. We can filter rows two ways: by index or by label.

- `loc` (label-based indexing): Used for selecting by the actual labels of your index (which are often numbers, but can be text or dates).

- `iloc` (integer-location based indexing): Used for selecting by the numerical position of the rows and columns (always 0-indexed).

### Selecting Rows with `loc`

`loc` works with the index labels. If your DataFrame has a default integer index (0, 1, 2...), `loc` can use these integers. You can select by:
- single rows
- multiple rows
- range of rows

In [None]:
# Single row
first_row = city_temps.loc[0]
first_row

In [None]:
# Multiple rows
mult_row = city_temps.loc[[2, 343, 10555]]
mult_row

In [None]:
# Range of rows
range_row = city_temps.loc[502:508]
range_row

### Selecting Rows with `iloc`

`iloc` works with the position of the row/column, always starting from 0. You can select by:
- single rows
- multiple rows
- range of rows

In [None]:
# Single row
first_row = city_temps.iloc[0]
first_row

In [None]:
# Multiple rows
mult_row = city_temps.iloc[[2, 343, 10555]]
mult_row

In [None]:
# Range of rows
range_row = city_temps.iloc[502:508]
range_row

#### `loc` vs. `iloc`

It may appear that `loc` and `iloc` are behaving exactly the same way, but that is only because the index in the `city_temps` dataframe is set to the default integer structure. Not all datasets will be like this and we can also change the index to a different column if we want (which we will learn later!)

#### Cross Fire

For selecting rows and columns simultaneously: `df.loc[row_labels, column_labels]` or `df.iloc[row_positions, column_positions]`.

In [None]:
specific_data = city_temps.loc[[323], ['City']]
print(specific_data)

In [None]:
specific_data = city_temps.iloc[[323], [1]]
print(specific_data)

In [None]:
# Select multiple cells
multiple_data = city_temps.loc[[323, 765], ['AverageTemperature', 'City']]
multiple_data

#### Slicing
You can "slice" a range of rows or columns by their labels. Important: Slicing with `loc` is inclusive of the start and end label.

In [None]:
sliced_data = city_temps.loc[5000:5005, 'AverageTemperature':'Country']
sliced_data

___
## 💪💪 **Pair Programming Exercise** 💪💪

1. Create a new dataframe called `temps_lat_long` that contains only the `AverageTemperature`, `City`, `Latitude`, and `Longitude` columns. Display the head of this new dataframe to check your work.
2. Select and display the data for the 10000th row.
___

## Conditional Filtering (Boolean Indexing)

This is one of the most powerful features of Pandas: selecting rows based on whether they meet a certain condition. You pass a Series of `True`/`False` values (a "Boolean Series") inside the square brackets. Only rows where the value is `True` will be returned. The easiest way to undrstand this is by breaking it down, step by step:

**Step 1: Create a condition**

Let's ask a question: "Which cities have experienced very hot average temperature for a specific month?" Let's say 35&deg;C (95&deg;F) is "very hot".

In [None]:
condition = city_temps['AverageTemperature'] >= 35

**Step 2: Generate the Boolean Series**

If you `print(condition)`, you wouldn't see the cities. You would see the Boolean Series—the `True`/`False` answer for every single city:

In [None]:
print(condition)

**Step 3: Use the series to filter the dataframe**
Now, you use that condition (the Boolean Series) as a filter inside the square brackets of your original dataframe.

Pandas will look at your `city_temps` dataframe, and for every row, it will check the corresponding value in your condition series. It will only keep the rows where the value is True.

In [None]:
high_temps = city_temps[condition]
high_temps

You will often see these steps condensed into a single line of code:

In [None]:
high_temps = city_temps[city_temps['AverageTemperature'] >= 35]
high_temps

### Combining Conditions

You can combine multiple conditions using logical operators:

- & for AND (both conditions must be true)
- | for OR (at least one condition must be true)

Remember to put each condition in parentheses!

In [None]:
condition = (city_temps['Country'] == 'India') & (city_temps['AverageTemperature'] > 36)
hot_india = city_temps[condition]
hot_india

___
## 💪💪 **Pair Programming Exercise** 💪💪

Using the original `global_temps` dataframe:

1. Find all the records in the `global_temps` dataframe where the Country is 'United States'. Create a new dataframe called `us_temps` to store this data. Display the number of records in this new dataframe.
2. Now, let's find the coldest months on record in Chicago. Filter the `us_temps` dataframe to find all records for the City of 'Chicago' where the `AverageTemperature` was below -15°C.
___

## 📓 Reflection 📓

How was the experience of "pair programming"? What did you like/not like about it? Put yourself in your students' shoes. How do you think they would react to pair programming? 