## Introduction to pandas - Solutions

In [None]:
import pandas as pd

wine_reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)

#### Exercise 1

In the cell below, create a DataFrame `fruits` that looks like this:

![](https://i.imgur.com/Ax3pp2A.png)

In [None]:
fruits = pd.DataFrame([[30, 21]], columns=['Apples', 'Bananas'])
fruits

#### Exercise 2

Create a dataframe `fruit_sales` that matches the diagram below:

![](https://i.imgur.com/CHPn7ZF.png)

In [None]:
fruit_sales = pd.DataFrame({'Apples': [35,41],'Bananas':[21,34]}, index=['2017 Sales','2018 Sales'])
fruit_sales

#### Exercise 3

Create a variable `ingredients` with a Series that looks like:

```
Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object
```

In [None]:
quantities = ['4 cups', '1 cup', '2 large', '1 can']
items = ['Flour', 'Milk', 'Eggs', 'Spam']
ingredients = pd.Series(quantities, index=items, name='Dinner')

ingredients

#### Exercice 4

Select the `description` column from `reviews` and assign the result to the variable `desc`.

What type of object is `desc` ?

In [None]:
desc = wine_reviews.description
type(desc)

#### Exercise 5

Select the first value from the description column of `reviews`, assigning it to variable `first_description`.

In [None]:
first_description = wine_reviews.description[0]
first_description

#### Exercise 6 

Select the first row of data (the first record) from `reviews`, assigning it to the variable `first_row`.

In [None]:
first_row = wine_reviews.iloc[0,:]
first_row

#### Exercise 7

Select the first 10 values from the `description` column in `reviews`, assigning the result to variable `first_descriptions`.

Hint: format your output as a pandas Series.

In [None]:
first_descriptions = wine_reviews.loc[:9,'description']
first_descriptions

#### Exercise 8

Select the records with index labels `1`, `2`, `3`, `5`, and `8`, assigning the result to the variable `sample_reviews`.

In other words, generate the following DataFrame:

![](https://i.imgur.com/sHZvI1O.png)

In [None]:
indices = [1, 2, 3, 5, 8]
sample_reviews = wine_reviews.loc[indices]

sample_reviews

#### Exercise 9

Create a variable `df` containing the `country`, `province`, `region_1`, and `region_2` columns of the records with the index labels `0`, `1`, `10`, and `100`. In other words, generate the following DataFrame:

![](https://i.imgur.com/FUCGiKP.png)

In [None]:
cols = ['country', 'province', 'region_1', 'region_2']
indices = [0, 1, 10, 100]
df = wine_reviews.loc[indices, cols]

df

#### Exercise 10

Create a variable `df` containing the `country` and `variety` columns of the first 100 records.

In [None]:
cols = ['country', 'variety']
df = wine_reviews.loc[:99, cols]

df

#### Exercise 11

Create a DataFrame `italian_wines` containing reviews of wines made in `Italy`.

In [None]:
italian_wines = wine_reviews[wine_reviews.country == 'Italy']

#### Exercise 12

Create a DataFrame `top_oceania_wines` containing all reviews with at least 95 points (out of 100) for wines from Australia or New Zealand.

In [None]:
top_oceania_wines = wine_reviews.loc[
    (wine_reviews.country.isin(['Australia', 'New Zealand']))
    & (wine_reviews.points >= 95)
]

top_oceania_wines

#### Exercise 13

What is the median of the `points` column in the `reviews` DataFrame?

In [None]:
median_points = wine_reviews.points.median()

#### Exercise 14

What countries are represented in the dataset? (Your answer should not include any duplicates.)

In [None]:
countries = wine_reviews.country.unique()

#### Exercise 15

How often does each country appear in the dataset? Create a Series `reviews_per_country` mapping countries to the count of reviews of wines from that country.

In [None]:
reviews_per_country = wine_reviews.country.value_counts()

#### Exercise 16

Create variable `centered_price` containing a version of the `price` column with the mean price subtracted.

In [None]:
centered_price = wine_reviews.price - wine_reviews.price.mean()

#### Exercise 17 

I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable `bargain_wine` with the title of the wine with the highest points-to-price ratio in the dataset.

In [None]:
bargain_idx = (wine_reviews.points / wine_reviews.price).idxmax()
bargain_wine = wine_reviews.loc[bargain_idx, 'title']

#### Exercise 18

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series `descriptor_counts` counting how many times each of these two words appears in the `description` column in the dataset.

In [None]:
n_trop = wine_reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = wine_reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

#### Exercise 19

We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [None]:
def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1

star_ratings = wine_reviews.apply(stars, axis='columns')

#### Exercise 20

Who are the most common wine reviewers in the dataset? Create a `Series` whose index is the `taster_twitter_handle` category from the dataset, and whose values count how many reviews each person wrote.

In [None]:
reviews_written = wine_reviews.groupby('taster_twitter_handle').size()

#### Exercise 21

What is the best wine I can buy for a given amount of money? Create a `Series` whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that `4.0` dollars is at the top and `3300.0` dollars is at the bottom).

In [None]:
best_rating_per_price = wine_reviews.groupby('price')['points'].max().sort_index()

#### Exercise 22

What are the minimum and maximum prices for each `variety` of wine? Create a `DataFrame` whose index is the `variety` category from the dataset and whose values are the `min` and `max` values thereof.

In [None]:
price_extremes = wine_reviews.groupby('variety').price.agg([min, max])

#### Exercise 23

What are the most expensive wine varieties? Create a variable `sorted_varieties` containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).

In [None]:
sorted_varieties = price_extremes.sort_values(by=['min', 'max'], ascending=False)

#### Exercise 24

Create a `Series` whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the `taster_name` and `points` columns.

In [None]:
reviewer_mean_ratings = wine_reviews.groupby('taster_name').points.mean()

#### Exercise 25

What combination of countries and varieties are most common? Create a `Series` whose index is a `MultiIndex`of `{country, variety}` pairs. For example, a pinot noir produced in the US should map to `{"US", "Pinot Noir"}`. Sort the values in the `Series` in descending order based on wine count.

In [None]:
country_variety_counts = wine_reviews.groupby(['country', 'variety']).size().sort_values(ascending=False)

#### Exercise 26

What is the data type of the `points` column in the dataset?

In [None]:
dtype = wine_reviews.points.dtype
dtype

#### Exercise 27

Create a Series from entries in the `points` column, but convert the entries to strings. Hint: strings are `str` in native Python.

In [None]:
point_strings = wine_reviews.points.astype(str)

#### Exercise 28

Sometimes the price column is null. How many reviews in the dataset are missing a price?

In [None]:
missing_price_reviews = wine_reviews[wine_reviews.price.isnull()]
n_missing_prices = len(missing_price_reviews)

n_missing_prices

#### Exercise 29

What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the `region_1` field. This field is often missing data, so replace missing values with `Unknown`. Sort in descending order.  Your output should look something like this:

```
Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64
```

In [None]:
reviews_per_region = wine_reviews.region_1.fillna('Unknown').value_counts().sort_values(ascending=False)

#### Exercise 30

In [None]:
wine_reviews.head()

`region_1` and `region_2` are pretty uninformative names for locale columns in the dataset. Create a copy of `reviews` with these columns renamed to `region` and `locale`, respectively.

In [None]:
renamed = wine_reviews.rename(columns=dict(region_1='region', region_2='locale'))

#### Exercise 31

Set the index name in the dataset to `wines`.

In [None]:
reindexed =  wine_reviews.rename_axis('wines', axis='rows')