# Pandas
- Solve short hands-on challenges to perfect your data manipulation skills.
- https://www.kaggle.com/learn/pandas

## 3.- Summary Functions and Maps
- Extract insights from your data.    
Let's see

In [1]:
import numpy as np
import pandas as pd

print('np.__version__:', np.__version__)
print('pd.__version__:', pd.__version__)

#pd.set_option('display.max_rows', 5)

np.__version__: 1.23.5
pd.__version__: 1.5.3


In [2]:
reviews = pd.read_csv('Red.csv')
reviews

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.00,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.50,2017
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015
3,Bardolino 2019,Italy,Bardolino,Cavalchina,3.5,100,8.72,2019
4,Ried Scheibner Pinot Noir 2016,Austria,Carnuntum,Markowitsch,3.9,100,29.15,2016
...,...,...,...,...,...,...,...,...
8661,6th Sense Syrah 2016,United States,Lodi,Michael David Winery,3.8,994,16.47,2016
8662,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,4.0,995,20.09,2016
8663,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,996,23.95,2010
8664,Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,3.5,998,6.21,2019


### Summary functions (not an official name)
- ex. .discribe() 

In [3]:
# describe for all columns (df.describe())
reviews.describe()
# type(reviews.describe())    # pandas.core.frame.DataFrame

Unnamed: 0,Rating,NumberOfRatings,Price
count,8666.0,8666.0,8666.0
mean,3.890342,415.287445,39.145065
std,0.308548,899.726373,84.936307
min,2.5,25.0,3.55
25%,3.7,66.0,10.68
50%,3.9,157.0,18.2
75%,4.1,401.0,38.1425
max,4.8,20293.0,3410.79


In [4]:
# describe for only one column (Series.describe())
reviews.Price.describe()
# type(reviews.Price.describe())  # pandas.core.series.Series

count    8666.000000
mean       39.145065
std        84.936307
min         3.550000
25%        10.680000
50%        18.200000
75%        38.142500
max      3410.790000
Name: Price, dtype: float64

In [5]:
# for string data here's what we get:
reviews.Country.describe()

count      8666
unique       30
top       Italy
freq       2650
Name: Country, dtype: object

In [6]:
# particular simply statistics, ex.:
print(f"{'Rating.mean():':<20} {reviews.Rating.mean():>8.2f}")
print(f"{'Price.max():':<20} {reviews.Price.max():>8.2f}")
print(f"{'Country.count():':<20} {reviews.Country.count():>8.2f}")

Rating.mean():           3.89
Price.max():          3410.79
Country.count():      8666.00


In [7]:
# To see a list of unique values we can use the unique() function:
print(len(reviews.Country.unique()))
reviews.Country.unique()
# type(reviews.Country.unique())      # numpy.ndarray

30


array(['France', 'Italy', 'Austria', 'New Zealand', 'Chile', 'Australia',
       'South Africa', 'Spain', 'United States', 'Portugal', 'Hungary',
       'Brazil', 'Argentina', 'Romania', 'Germany', 'Greece', 'Mexico',
       'Moldova', 'Switzerland', 'Slovenia', 'Israel', 'Georgia',
       'Lebanon', 'Uruguay', 'Turkey', 'Croatia', 'China', 'Slovakia',
       'Bulgaria', 'Canada'], dtype=object)

In [8]:
# List of unique values and how often they occur in the dataset, we can use the
# value_counts() method:
reviews.Country.value_counts()

Italy            2650
France           2256
Spain            1142
South Africa      500
United States     374
Chile             326
Germany           248
Australia         246
Argentina         246
Portugal          230
Austria           220
New Zealand        63
Brazil             40
Romania            23
Lebanon            15
Israel             13
Greece             13
Switzerland        12
Hungary             9
Moldova             8
Slovenia            8
Turkey              6
Georgia             5
Uruguay             4
Croatia             2
Bulgaria            2
Canada              2
Mexico              1
China               1
Slovakia            1
Name: Country, dtype: int64

### Maps
There are two mapping methods that you will use often: map() and apply()

In [9]:
###  Remean the Rating to 0. We can do this as follows:
reviews.Rating.map(lambda r: r - reviews.Rating.mean())

0       0.309658
1       0.409658
2       0.009658
3      -0.390342
4       0.009658
          ...   
8661   -0.090342
8662    0.109658
8663   -0.190342
8664   -0.390342
8665   -0.490342
Name: Rating, Length: 8666, dtype: float64

In [10]:
# apply() is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.
def remean_rating(row):
    row.Rating = row.Rating - reviews.Rating.mean()
    return row

reviews.apply(remean_rating, axis='columns')

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,0.309658,100,95.00,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,0.409658,100,15.50,2017
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,0.009658,100,7.45,2015
3,Bardolino 2019,Italy,Bardolino,Cavalchina,-0.390342,100,8.72,2019
4,Ried Scheibner Pinot Noir 2016,Austria,Carnuntum,Markowitsch,0.009658,100,29.15,2016
...,...,...,...,...,...,...,...,...
8661,6th Sense Syrah 2016,United States,Lodi,Michael David Winery,-0.090342,994,16.47,2016
8662,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,0.109658,995,20.09,2016
8663,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,-0.190342,996,23.95,2010
8664,Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,-0.390342,998,6.21,2019



If we had called reviews.apply() with axis='index', then instead of passing a function to transform each row, we would need to give a function to transform each column.

Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.

In [11]:
reviews.head(3)

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.0,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.5,2017
2,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015


In [42]:
## Pandas provides many common mapping operations as built-ins. For example,
# here's a faster way of remeaning our points column:
reviews.Rating - reviews.Rating.mean()

0       0.309658
1       0.409658
2       0.009658
          ...   
8663   -0.190342
8664   -0.390342
8665   -0.490342
Name: Rating, Length: 8666, dtype: float64

In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

In [43]:
## Pandas will also understand what to do if we perform these operations between
# Series of equal length. For example, an easy way of combining country and
# region information in the dataset would be to do the following:
reviews.Country + ' - ' + reviews.Region

0                          France - Pomerol
1                            France - Lirac
2                           Italy - Toscana
                       ...                 
8663                    France - Haut-Médoc
8664    Australia - South Eastern Australia
8665                    Argentina - Tunuyán
Length: 8666, dtype: object

These operators are faster than map() or apply() because they use speed ups built into pandas. All of the standard Python operators (>, <, ==, and so on) work in this manner.

However, they are not as flexible as map() or apply(), which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.