# Data types and missing data workbook

## Introduction

This is the workbook component of the "Data types and missing data" section of the tutorial.

# Relevant Resources
- [Data Types and Missing Data Reference](https://www.kaggle.com/residentmario/data-types-and-missing-data-reference)

# Set Up
**Fork this notebook using the button towards the top of the screen.**

Run the following cell to load your data and some utility functions

In [None]:
import pandas as pd
import seaborn as sns
from learntools.advanced_pandas.data_types_missing_data import *

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('max_rows', 5)

# Checking Answers

**Check your answers in each exercise using the  `check_qN` function** (replacing `N` with the number of the exercise). For example here's how you would check an incorrect answer to exercise 1:

In [None]:
check_q1(pd.DataFrame())

If you get stuck, **use the `answer_qN` function to see the code with the correct answer.**

For the first set of questions, running the `check_qN` on the correct answer returns `True`.

For the second set of questions, using this function to check a correct answer will present an informative graph!

# Exercises

**Exercise 1**: What is the data type of the `points` column in the dataset?

In [None]:
reviews.points.dtype

**Exercise 2**: Create a `Series` from entries in the `price` column, but convert the entries to strings. Hint: strings are `str` in native Python.

In [None]:
pd.Series(reviews.price.astype('str'))

Here are a few visual exercises on missing data.

**Exercise 3**: Wines are something missing prices. How often does this occur? Generate a `Series`that, for each review in the dataset, states whether the wine reviewed has a null `price`.

In [None]:
pd.Series(reviews.price.isnull())

**Exercise 4**: What are the most common wine-producing regions? Create a `Series` counting the number of times each value occurs in the `region_1` field. This field is often missing data, so replace missing values with `Unknown`. Sort in descending order. Your output should look something like this:

```
Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64
```

In [None]:
reviews.region_1.fillna("Unknown").value_counts()

**Exercise 5**: Now for something more challenging. Although the dataset doesn't include a specific field for this information, many of the wines reviewed by are from a series of wines specific to a given growing season and hence, year (or "vintage"). For aficionados, significant differences exist between wines of different vintages. The `title` of the wine often mentions the vintage.

Create a `Series` showing, for each wine, what vintage (year) the wine comes from. Do this by extracting the year, if one occurs, from the `title` field of the dataset. Report wines missing vintages as `N/A`. Sort the values in ascending order (e.g. earlier years before later ones).

Hint: write a map that will extract the vintage of each wine in the dataset. The vintages reviewed range from 2000 to 2017, no earlier or later. Use `fillna` to impute the missing values.

In [None]:
def find_year(description):
    if "20" in description:
        idx = description.index("20")
        maybe_date = description[idx:idx + 4]
        if maybe_date.isdigit():
            return maybe_date
        else:
            return None
    else:
        return None
        
reviews.title.map(find_year).fillna("N/A").sort_values()

# Keep going

Move on to the [**Renaming and combining workbook**](https://www.kaggle.com/residentmario/renaming-and-combining-workbook)