# Relevant Resources
* **[Summary functions and maps](https://www.kaggle.com/residentmario/summary-functions-and-maps-reference)**
* [Official pandas cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)

# Set Up
Run the code cell below to load your data and the necessary utility functions.

In [None]:
import pandas as pd
pd.set_option('max_rows', 10)
import numpy as np
from learntools.advanced_pandas.summary_functions_maps import *

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

## Exercises

Look at your data by running the cell below:

In [None]:
reviews.head()

**Exercise 1**: What is the median of the `points` column?

In [None]:
reviews['points'].median()
# another initial exploratory function is describe()

**Exercise 2**: What countries are represented in the dataset?

In [None]:
reviews['country'].unique()

**Exercise 3**: What countries appear in the dataset most often?

In [None]:
reviews['country'].value_counts()

**Exercise 4**: Remap the `price` column by subtracting the median price. Use the `Series.map` method.

In [None]:
median_price = reviews['price'].median()
reviews['price'].map(lambda p: p - median_price)
# map takes every value in the column it is being called on and converts it 
# to some new value using a function you provide it
# map takes a Series as input
# For a simple substraction like the above doing the following would yield same result faster
# reviews['price] - median_price

**Exercise 5**: I"m an economical wine buyer. Which wine in is the "best bargain", e.g., which wine has the highest points-to-price ratio in the dataset?

Hint: use a map and the [`argmax` function](http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.Series.argmax.html).

In [None]:
reviews.iloc[(reviews['points'] / reviews['price']).idxmax()].title
# unlike the proposed solution to the current q5 I have substituted
# idxmax() for the deprecated argmax() funtion.
# idxmax() returns the Index of ONLY the first occurrence of maximum of values.
# in this case the value 64590 which is then used by the (i)loc operator
# but in reality there are TWO wines that have a 21.5 ratio and the following
# would be more correct not using the idxmax() function
reviews[(reviews.points/reviews.price) == (reviews.points/reviews.price).max()].title
# the same solution can also be found using this other approach
# by using numpy function nanmax which will Return the maximum of an array
# or maximum along an axis, ignoring any NaNs.
reviews.loc[(reviews.points / reviews.price) == np.nanmax((reviews.points / reviews.price))].title

Now it's time for some visual exercises. In the questions that follow, generate the data that we will need to have in order to produce the plots that follow. These exercises will use skills from this workbook as well as from previous ones. They look a lot like questions you will actually be asking when working with your own data!

<!--
**Exercise 6**: Sometimes the `province` and `region_1` provided in the dataset is the same value. Create a `Series` whose values counts how many times this occurs (`True`) and doesn't occur (`False`).
-->

**Exercise 6**: Is a wine more likely to be "tropical" or "fruity"? Create a `Series` counting how many times each of these two words appears in the `description` column in the dataset.

Hint: use a map to check each description for the string `tropical`, then count up the number of times this is `True`. Repeat this for `fruity`. Create a `Series` combining the two values at the end.

In [None]:
# this solution is obtained as follows
tropical = reviews.description.map(lambda d: 'tropical' in d).value_counts()
fruity = reviews.description.map(lambda d: 'fruity' in d).value_counts()
# this will map the records having the string 'tropical' or 'fruity' in the description to True
# and the value_counts() function will return the number of 
# False (tropical=126364) (fruity=120881) and True (tropical=3607) (fruity=9090)
# so if I assign the value_count to a variable then tropical[True] will return 
# the number of description in which the word tropical (and fruity) was found
# the following will put the numbers into a Series with proper index names
pd.Series([tropical[True],fruity[True]], index=['tropical', 'fruity'])
# an alternative solution could be
pd.Series([reviews.description.map(lambda p: 'tropical' in p).sum(), 
           reviews.description.map(lambda p: 'fruity' in p).sum()],
            index=['tropical', 'fruity'])
# as an added reminder please note there are 374 Fruity and 204 Tropical strings
# which are NOT counted with the above solutions but here follows total solution
pd.Series([reviews.description.map(lambda p: 'tropical' in p).sum(), 
           reviews.description.map(lambda p: 'Tropical' in p).sum(),
           reviews.description.map(lambda p: 'fruity' in p).sum(),
           reviews.description.map(lambda p: 'Fruity' in p).sum()],
            index=['tropical', 'Tropical', 'fruity', 'Fruity'])


**Exercise 7**: What combination of countries and varieties are most common?

Create a `Series` whose index consists of strings of the form `"<Country> - <Wine Variety>"`. For example, a pinot noir produced in the US should map to `"US - Pinot Noir"`. The values should be counts of how many times the given wine appears in the dataset. Drop any reviews with incomplete `country` or `variety` data.

Hint: you can do this in three steps. First, generate a `DataFrame` whose `country` and `variety` columns are non-null. Then use a map to create a series whose entries are a `str` concatenation of those two columns. Finally, generate a `Series` counting how many times each label appears in the dataset.

In [None]:
# use the loc function to select only rows in which BOTH the country and the variety
# are not null. In reality variety is never null but let's ignore this
# also select only the country and variety columns to build the new ans dataframe
ans = reviews.loc[(reviews['country'].notnull()) & (reviews['variety'].notnull()),['country','variety']]
# now use the apply function to create a series catenating country a dash and variety
# I am using the formatted print as the catenating funcntion and not the + operator
# since the former is better for readability and performance
# see https://softwareengineering.stackexchange.com/questions/304445/why-is-s-better-than-for-concatenation
ans = ans.apply(lambda srs: "%s - %s" % (srs.country, srs.variety), axis='columns')
# while following is the official suggestion with the + catenation
# ans = ans.apply(lambda srs: srs.country + " - " + srs.variety, axis='columns')
ans.value_counts().head(10)

# Keep Going
**[Continue to grouping and sorting](https://www.kaggle.com/kernels/fork/598715).**