# Lesson 3: Text Data and Messy/Missing Data

To jump to the recap, click [here](#recap)

# Initial Setup

Import libraries and initialize variables to pick up where we left off in Lesson 2.

In [None]:
import pandas as pd

%matplotlib inline

In [None]:
weather_yvr = pd.read_csv('data/weather_yvr.csv')
weather_yvr['Relative Humidity (fraction)'] = weather_yvr['Relative Humidity (%)'] / 100
weather_yvr['Temperature (F)'] = 1.8 * weather_yvr['Temperature (C)'] + 32

# Text Data

What about text data like the `'Conditions'` column?

In [None]:
conditions = weather_yvr['Conditions']

In [None]:
conditions

The Series `conditions` has a `dtype` of `object`

There are many repeated values in this Series.
- How many unique values are in the Series?
- How often does each value occur?
- What are the most common values?

We can use the `unique` method to list the unique values:

In [None]:
conditions.unique()

We can use the function `len` to find the number of unique values:

In [None]:
len(conditions.unique())

Alternatively, we could use the `nunique` method to find the number of unique values:

In [None]:
conditions.nunique()

`value_counts` is a very handy method to quickly summarize a Series of text data and find the most common values:

In [None]:
conditions.value_counts()

We can assign the output of `value_counts` to a variable:

In [None]:
conditions_counts = conditions.value_counts()

In [None]:
conditions_counts

What data type do you think `conditions_counts` is?

In [None]:
type(conditions_counts)

In [None]:
conditions_counts.index

- `conditions_counts` is a Series, with `dtype` of `int64`
- It has text labels as its index

We can plot `conditions_counts` as a vertical or horizontal bar chart. Here is a horizontal bar chart:

In [None]:
conditions_counts.plot(kind='barh');

# Messy Data

`pandas` makes it very easy to spot inconsistencies and missings in your data

In [None]:
messy = pd.read_csv('data/weather_yvr_messy.csv')
messy.head()

Let's check out `value_counts` for the `'Conditions'` column:

In [None]:
conditions_m = messy['Conditions']
conditions_m.value_counts()

- We can see inconsistencies in capitalization and white space in these values
- Categories that should be the same (e.g. 'Mainly Sunny' and 'Mainly sunny') are counted as separate categories

The `unique` method can give some additional insights:

In [None]:
conditions_m.unique()

- Values are sorted so that similar values values are grouped together
- Extra leading / trailing white spaces are clearly visible
- Missing values appear as `nan`

We can see how many missing values there are with the `dropna` keyword argument to `value_counts`:

In [None]:
conditions_m.value_counts(dropna=False)

We can apply the string methods we saw earlier to the `conditions_m` to quickly and easily standardize the Series
- Convert all values to lower case
- Strip extra leading and trailing white space from all values

In [None]:
conditions_lower = conditions_m.str.lower()
conditions_lower

In [None]:
conditions_clean = conditions_lower.str.strip()
conditions_clean

The previous two steps could be consolidated into a single line of code, using method chaining:

In [None]:
conditions_clean = conditions_m.str.lower().str.strip()
conditions_clean

In [None]:
conditions_clean.value_counts(dropna=False)

We can add this standardized version of the `'Conditions'` column to our DataFrame and save to CSV:

In [None]:
messy['Conditions (standardized)'] = conditions_clean
messy.head()

In [None]:
messy.to_csv('data/weather_yvr_cleaned.csv', index=False)

# Notebooks vs. Scripts

So far our workflow has been exploratory and interactive:

![](img/workflow0.png)

Fernando Perez (creator of IPython and Jupyter) calls this "humans in the loop"

- Write a bit of code
- Run the code
- Look at the output and see what's interesting, what needs to be done next, new questions to ask
- Write a bit more code
- and so on...

Sometimes, we might want to develop a more automated workflow for tasks we need to do over and over.

![](img/workflow1.png)

- Suppose we have a bunch of CSV files with messy weather data similar to the previous example
- We might want to repeat the above steps to process each file and save the standardized data to new files
- We could adapt the code from our notebook into a Python **script**

*See the extra section "Automating Tasks with Scripts" in `Lesson 3 - Text Data and Messy Data.ipynb` to learn how to create and run a script in Jupyter Lab*

# Automating Tasks with Scripts

### Writing a Script

- In Jupyter Lab, make sure you're in the main folder with your workshop files
- From the Launcher, click the "Text Editor" icon near the bottom
- A new text file is created&mdash;rename it from "untitled.txt" to "my_script.py"
- Copy the relevant lines of code from the messy data example in our notebook into "my_script.py". You'll want to include the following steps in your script:
  - Import `pandas` library
  - Read `'data/weather_YVR_messy.csv'` into a DataFrame
  - Apply the `strip` and `lower` string methods to the `'Conditions'` column of the DataFrame
  - You can add the cleaned data as a new column (e.g. `'Conditions (standardized)'`) or simply over-write the `'Conditions'` column with the cleaned data
  - Save the cleaned data to a new CSV file
- Press Ctrl-S (or Cmd-S on Mac) to make sure "my_script.py" is saved
- To see an example of what this would look like, check out "example_script.py"

### Running a Script

- From the Launcher in Jupyter Lab, click the "Terminal" icon near the bottom
- To the left of the command prompt, it will show what folder you're in
  - This might not be the workshop folder&mdash;you might be in your main user folder
- To navigate to your workshop folder, use the command `cd` followed by the relative path of the folder:
  - For example, on my computer, the terminal opens in the folder `C:\Users\jenfly`. From here, I use the command:
```
cd Projects\pydata-intro-workshop
```
  - This will change my working folder to be `C:\Users\jenfly\Projects\pydata-intro-workshop`, which will now appear to the left of the prompt
  - On your computer, you'll want to substitute the appropriate folder names, and if you're on a Mac, use forward slashes `/` instead of back slashes `\`
- Now that you're in the correct folder, run the following command at the prompt to run your script:
```
python my_script.py
```
- If everything worked, the script will execute with no error messages and the new CSV file with the cleaned data will have been created!
  - Incorporating `print` statements into your script can help you verify that it's running properly. For an example of this, try running the sample script:
```
python example_script.py
```

# Missing Data

- We saw from `conditions_m.value_counts(dropna=False)` that there are 2 missing values in `conditions_m`
- With any data that we're working with, it's good to know:
  - How many values are missing?
  - Where are the empty cells located in our DataFrame (or Series)?

We can use the `isnull` method to locate missing values

In [None]:
missing_conditions = conditions_m.isnull()
missing_conditions

- `missing_conditions` is a Series of Booleans, with `True` where the value in `conditions_m` is missing and `False` where it is not missing

- We can count the missings using the `sum` method:
  - Adds up all the values in the Series, treating `True` as 1 and `False` as 0

In [None]:
missing_conditions.sum()

The `isnull` method can be applied to an entire DataFrame:

In [None]:
missings = messy.isnull()
missings

- We can find the number of missings in each column of the DataFrame with the `sum` method:
  - Computes the sum along each column

In [None]:
missings.sum()

If you need to fill your missing data, there are many tools that can be used, such as the `pandas` methods `fillna` and `interpolate`

<a id="recap"></a>
# Recap 3

### Counting Unique Values

Unique values in a Series: 
```
series.unique()
```

Number of unique values in a Series:
```
series.nunique()
```
or you could use `len(series.unique())`


Counts of each unique value in a Series
- Excluding missing values:
```
series.value_counts()
```
- Including missing values:
```
series.value_counts(dropna=False)
```

### Bar Charts

Plot a horizontal bar chart of a Series: 
```
series.plot(kind='barh')
```
For a vertical bar chart, use `kind='bar'`.

### Text Processing

Apply string methods to a text Series&mdash;use string methods in `series.str`:
```
series_lower = series.str.lower()
```
Apply multiple methods with method chaining:
```
series_lower_stripped = series.str.lower().str.strip()
```
  
  
### Missing Data

Locate missing values in a Series or DataFrame
```
data.isnull()
```

Calculate the total number of missing values in a Series, or in each column of a DataFrame: 
```
data.isnull().sum()
```

# Exercise 3

a) Familiarize yourself with the file `'data/weather_airport_stations.csv'` in the Jupyter Lab CSV viewer, and then read it into a new variable `weather_all`
- Display a random sampling of 10 rows
- How many rows and columns does the data have?
- What are the lowest and highest temperatures in the data?

b) How many unique Station Names and Datetimes are in the data? List the unique values.

c) What are the three most common Conditions and the three most common Wind Directions?

d) Which column has the most missing values? How many are missing in this column?

#### Bonus exercises

e) How many temperatures in `weather_all` are less than 0 and how many are greater than 20?
- *Hint: review the comparison operators in the "Booleans" section of Lesson 0*

f) Work through the steps in the "Automating Tasks with Scripts" section in Lesson 3 to create and run a script in Jupyter Lab

a) Read the file `'data/weather_airport_stations.csv'` into a new variable `weather_all`.

- A random sampling of 10 rows of `weather_all`:

In [None]:
weather_all = pd.read_csv('data/weather_airport_stations.csv')
weather_all.sample(10)

- Number of rows and columns:

In [None]:
weather_all.shape

The data has 480 rows and 17 columns.

- Lowest and highest temperature:

In [None]:
weather_all['Temperature (C)'].describe()

The lowest temperature is -7.1 C and the highest is 28.6 C.

Aside: What happens if we try to display the entire DataFrame `weather_all` in our notebook?

In [None]:
weather_all

For large DataFrames, the first 30 and last 30 rows are displayed, with a `...` in between

b) How many unique Station Names and Datetimes are in the data? List the unique values.

In [None]:
station_names_unique = weather_all['Station Name'].unique()
print(station_names_unique)
print(len(station_names_unique))

In [None]:
datetime_unique = weather_all['Datetime'].unique()
print(datetime_unique)
print(weather_all['Datetime'].nunique())

There are 20 unique station names and 52 unique datetimes.

c) What are the three most common Conditions and the three most common Wind Directions?

In [None]:
weather_all['Conditions'].value_counts()

In [None]:
weather_all['Wind Direction'].value_counts()

- Three most common Conditions: Mostly Cloudy, Partly Cloudy, Mainly Sunny
- Three most common Wind Directions: S, SSW, SSE

d) Which column has the most missing values? How many are missing in this column?

In [None]:
weather_all.isnull().sum()

The 'Humidex (C)' column has the most missing values (446).

e) How many temperatures are less than 0 and how many are greater than 20?

In [None]:
temp = weather_all['Temperature (C)']

In [None]:
temp_lt_0 = temp < 0
temp_lt_0.head()

In [None]:
temp_lt_0.sum()

There are 25 temperatures less than 0 C.

In [None]:
temp_gt_20 = temp > 20
temp_gt_20.head()

In [None]:
temp_gt_20.sum()

There are 107 temperatures greater than 20 C.

# Interlude: Data Visualization