<a href="https://colab.research.google.com/github/kumbhat10/Data_Science_in_Python/blob/master/Learning_Pandas/Summary_Functions_And_Maps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd

## **Mount Google drive and import data to use**


In [0]:
from google.colab import drive  # mount the google drive to access the wine reviews csv file
drive.mount('/content/gdrive')
reviews = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/Sample_data/winemag-data-130k-v2.csv')

In [48]:
reviews.head(5)  ## first 2 rows info

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,new_points
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.447138
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,-1.447138
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,-1.447138
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,-1.447138
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,-1.447138


# **Summary functions**

In [14]:
reviews.points.describe()

count    129971.000000
mean         88.447138
std           3.039730
min          80.000000
25%          86.000000
50%          88.000000
75%          91.000000
max         100.000000
Name: points, dtype: float64

This method generates a high level summary of the attributes of the given column. 

It is type-aware, meaning that its output changes based on the data type of the input. 

The output above only makes sense for numerical data; for string data we get as below

In [15]:
reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

In [16]:
reviews.points.mean()

88.44713820775404

## Unique function

In [17]:
reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

## Value counts method value_counts()

In [18]:
reviews.taster_name.value_counts()

Roger Voss            25514
Michael Schachner     15134
Kerin O’Keefe         10776
Virginie Boone         9537
Paul Gregutt           9532
Matt Kettmann          6332
Joe Czerwinski         5147
Sean P. Sullivan       4966
Anna Lee C. Iijima     4415
Jim Gordon             4177
Anne Krebiehl MW       3685
Lauren Buzzeo          1835
Susan Kostrzewa        1085
Mike DeSimone           514
Jeff Jenssen            491
Alexander Peartree      415
Carrie Dykes            139
Fiona Adams              27
Christina Pickard         6
Name: taster_name, dtype: int64

# **Maps**

A map takes one set of values and 'maps' them to another set of values.

There are 2 maping methods

## **map()**

Suppose we want to remean the wine score to 0 i.e offset data so that the new mean is 0, we can do it as follows


In [38]:
review_mean = reviews.points.mean()
reviews['new_points'] = reviews.points.map(lambda x: x-review_mean)   ## add new column with new points
print('Old mean was ',reviews.points.mean())
print('New mean is ',reviews.new_points.mean())

Old mean was  88.44713820775404
New mean is  -1.2830454158312965e-14


The function you pass to map() should expect a single value from the series and return a transformed version of that value.

map() returns a new series where all the values have been transformed by your function 

## **apply()**

apply() is the equivalent method if we want to **transform a whole DataFrame** by calling a custom method on each row

In [0]:
review_mean = reviews.points.mean()

def remean_points(row):
  row.points = row.points - review_mean
  return row

new_reviews = reviews.apply(remean_points,axis='columns')

apply() function has changed the values in DataFrame (table) itself and gives new table



In [46]:
reviews.head(1)   ## points is still same as  87 

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,new_points
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.447138


In [45]:
new_reviews.head(1) ## points have changed 

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,new_points
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,-1.447138,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,-1.447138


If we had called reviews.apply() with **axis = 'index'**, , then instead of passing a function to transform each row, we would need to give a function to transform each column

Pandas will also understand what to do if we perform these operations between series of equal length.

For ex:- An easy way of combining country and region information in the dataset would be to do the following

In [49]:
reviews.country + '--' + reviews.region_1

0                     Italy--Etna
1                             NaN
2           US--Willamette Valley
3         US--Lake Michigan Shore
4           US--Willamette Valley
                   ...           
129966                        NaN
129967                 US--Oregon
129968             France--Alsace
129969             France--Alsace
129970             France--Alsace
Length: 129971, dtype: object

# **Practice Exercise**

## 0.
I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable `bargain_wine` with the title of the wine with the highest points-to-price ratio in the dataset.

In [63]:
points_to_price_ratio = reviews.points/reviews.price
bargain_wine = reviews.title.iloc [points_to_price_ratio.idxmax()]
bargain_wine

'Bandit NV Merlot (California)'

## 1.
There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series `descriptor_counts` counting how many times each of these two words appears in the `description` column in the dataset.

In [59]:
des = reviews.description
count_fruity = reviews.description.map(lambda x: 'fruity' in x).sum()
count_tropical = reviews.description.map(lambda x: 'tropical' in x).sum()

descriptor_counts = pd.Series([count_tropical, count_fruity],index = ['tropical','fruity'])
print(descriptor_counts)

tropical    3607
fruity      9090
dtype: int64


## 2.
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series `star_ratings` with the number of stars corresponding to each review in the dataset.

In [61]:
def rating(row):
    if (row.country=='Canada')*1 | row.points>=95:
        star  = 3
    elif row.points>=85:
        star = 2
    else:
        star = 1
    return star
        
star_ratings = pd.Series(reviews.apply(rating,axis='columns'))
star_ratings

0         2
1         2
2         2
3         2
4         2
         ..
129966    2
129967    2
129968    2
129969    2
129970    2
Length: 129971, dtype: int64