# Pandas
- Solve short hands-on challenges to perfect your data manipulation skills.
- https://www.kaggle.com/learn/pandas

## 5.- Renaming and Combining
- Data comes in from many sources. Help it all make sense together.
- Columns names, index names, or other naming conventions to change.
- Combine data from multiple Dataframes and/or Series. 

In [14]:
import numpy as np
import pandas as pd

print('np.__version__:', np.__version__)
print('pd.__version__:', pd.__version__)

#pd.set_option('display.max_rows', 5)

np.__version__: 1.23.5
pd.__version__: 1.5.3


In [15]:
reviews = pd.read_csv('Red.csv')
reviews.head(2)

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.0,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.5,2017


In [16]:
## Add a twitter_region column !
reviews['twitter_region'] = '@' +  reviews.Region
reviews.head(2)

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.0,2011,@Pomerol
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.5,2017,@Lirac


In [17]:
display(reviews.head(2))
print(len(reviews))

ndf_row1 = pd.DataFrame({'Name': 'Name1', 'Country': np.nan, 'Region': 'Reg1',
                         'Winery': 'Win1', 'Rating': 1, 'NumberOfRatings': 101,
                         'Price': 10.1, 'Year': 2021, 'twitter_region': '@Reg1'}, index=[0])
revs = pd.concat([reviews, ndf_row1]).reset_index(drop=True)    # concat at the end
ndf_row2 = pd.DataFrame({'Name': 'Name2', 'Country': np.nan, 'Region': 'Reg2',
                         'Winery': 'Win2', 'Rating': 2, 'NumberOfRatings': 202,
                         'Price': 20., 'Year': 2022, 'twitter_region': '@Reg2'}, index=[0])
revs = pd.concat([ndf_row2, revs]).reset_index(drop=True)    # concat at the beginning and in the same df

lst_row3 = ['Name3', np.nan, 'Reg3', 'Win3', np.nan, 303, np.nan,2023, '@Reg3']
revs.loc[3423.5] = lst_row3                         # make index float and insert in .5
print('revs.index.dtype now:', revs.index.dtype)
revs = revs.sort_index().reset_index(drop=True)     # return index to int and add
revs

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.0,2011,@Pomerol
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.5,2017,@Lirac


8666
revs.index.dtype now: float64


Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
0,Name2,,Reg2,Win2,2.0,202,20.00,2022,@Reg2
1,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.00,2011,@Pomerol
2,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.50,2017,@Lirac
3,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015,@Toscana
4,Bardolino 2019,Italy,Bardolino,Cavalchina,3.5,100,8.72,2019,@Bardolino
...,...,...,...,...,...,...,...,...,...
8664,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,4.0,995,20.09,2016,@Maremma Toscana
8665,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,996,23.95,2010,@Haut-Médoc
8666,Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,3.5,998,6.21,2019,@South Eastern Australia
8667,Portillo Cabernet Sauvignon 2016,Argentina,Tunuyán,Salentein,3.4,999,7.88,2016,@Tunuyán


### Renaming
- The first function we'll introduce here is rename(), which lets you change index names and/or column names.

In [18]:
# change Rating column name to score
# revs.rename(columns={'Rating': 'score'}, inplace=True) # definitive
revs.rename(columns={'Rating': 'score'})
revs.head(2)

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
0,Name2,,Reg2,Win2,2.0,202,20.0,2022,@Reg2
1,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.0,2011,@Pomerol


In [19]:
# columns & index 'keywords' for rename
revs.rename(index={0: 'zero', 1: 'first'})

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
zero,Name2,,Reg2,Win2,2.0,202,20.00,2022,@Reg2
first,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.00,2011,@Pomerol
2,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.50,2017,@Lirac
3,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015,@Toscana
4,Bardolino 2019,Italy,Bardolino,Cavalchina,3.5,100,8.72,2019,@Bardolino
...,...,...,...,...,...,...,...,...,...
8664,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,4.0,995,20.09,2016,@Maremma Toscana
8665,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,996,23.95,2010,@Haut-Médoc
8666,Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,3.5,998,6.21,2019,@South Eastern Australia
8667,Portillo Cabernet Sauvignon 2016,Argentina,Tunuyán,Salentein,3.4,999,7.88,2016,@Tunuyán


In [20]:
revs.head(3)

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
0,Name2,,Reg2,Win2,2.0,202,20.0,2022,@Reg2
1,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.0,2011,@Pomerol
2,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.5,2017,@Lirac


#### set_index
You'll probably rename columns very often, but rename index values very rarely. For that, set_index() is usually more convenient.
> Both the row index and the column index can have their own name attribute. The complimentary rename_axis() method may be used to change these names. 

In [21]:
revs.rename_axis('wines', axis='rows').rename_axis('fields', axis='columns')

fields,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
wines,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,Name2,,Reg2,Win2,2.0,202,20.00,2022,@Reg2
1,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.00,2011,@Pomerol
2,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.50,2017,@Lirac
3,Erta e China Rosso di Toscana 2015,Italy,Toscana,Renzo Masi,3.9,100,7.45,2015,@Toscana
4,Bardolino 2019,Italy,Bardolino,Cavalchina,3.5,100,8.72,2019,@Bardolino
...,...,...,...,...,...,...,...,...,...
8664,Botrosecco Maremma Toscana 2016,Italy,Maremma Toscana,Le Mortelle,4.0,995,20.09,2016,@Maremma Toscana
8665,Haut-Médoc 2010,France,Haut-Médoc,Château Cambon La Pelouse,3.7,996,23.95,2010,@Haut-Médoc
8666,Shiraz 2019,Australia,South Eastern Australia,Yellow Tail,3.5,998,6.21,2019,@South Eastern Australia
8667,Portillo Cabernet Sauvignon 2016,Argentina,Tunuyán,Salentein,3.4,999,7.88,2016,@Tunuyán


### Combining
- When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways.
- .concat(), .join(), .merge()

In [23]:
# Make two news df based on two country of origen.
rev_nz = revs.loc[revs.Country == 'New Zealand']
display(rev_nz.head(2))
rev_bz = revs.loc[revs.Country == 'Brazil']
display(rev_bz.head(2))

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
7,Marion's Vineyard Pinot Noir 2016,New Zealand,Wairarapa,Schubert,4.0,100,43.87,2016,@Wairarapa
79,Pinot Noir 2016,New Zealand,Marlborough,Clos Henri Vineyard,3.9,102,27.78,2016,@Marlborough


Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
24,Virtus Tannat 2013,Brazil,Serra Gaúcha,Monte Paschoal,2.9,100,6.77,2013,@Serra Gaúcha
67,Fausto Tannat 2015,Brazil,Serra Gaúcha,Pizzato,3.6,1012,11.35,2015,@Serra Gaúcha


In [26]:
pd.concat([rev_nz, rev_bz]).reset_index(drop=True)

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year,twitter_region
0,Marion's Vineyard Pinot Noir 2016,New Zealand,Wairarapa,Schubert,4.0,100,43.87,2016,@Wairarapa
1,Pinot Noir 2016,New Zealand,Marlborough,Clos Henri Vineyard,3.9,102,27.78,2016,@Marlborough
2,Gimblett Gravels Le Sol 2015,New Zealand,Gimblett Gravels,Craggy Range,4.3,114,85.27,2015,@Gimblett Gravels
3,Pinot Noir 2016,New Zealand,Marlborough,Cloudy Bay,3.9,1163,39.95,2016,@Marlborough
4,Pinot Noir 2017,New Zealand,Marlborough,Brancott Estate,3.3,1180,10.21,2017,@Marlborough
...,...,...,...,...,...,...,...,...,...
98,Reserva Merlot 2016,Brazil,Serra Gaúcha,Don Guerino,3.7,85,12.62,2016,@Serra Gaúcha
99,Intenso Tannat 2013,Brazil,Campanha,Salton,3.3,90,9.95,2013,@Campanha
100,Merlot (Grande Vindima) 2008,Brazil,Encruzilhada do Sul,Lidio Carraro,3.9,902,45.77,2008,@Encruzilhada do Sul
101,Elos Touriga Nacional - Tannat 2012,Brazil,Encruzilhada do Sul,Lidio Carraro,3.9,941,29.19,2012,@Encruzilhada do Sul


### .join()
-  lets you combine different DataFrame objects which have an index in common


In [27]:
# same Rating in both Country
l = rev_nz.set_index(['Rating'])
r = rev_bz.set_index(['Rating'])

l.join(r, lsuffix='_NZ', rsuffix='_BZ')

Unnamed: 0_level_0,Name_NZ,Country_NZ,Region_NZ,Winery_NZ,NumberOfRatings_NZ,Price_NZ,Year_NZ,twitter_region_NZ,Name_BZ,Country_BZ,Region_BZ,Winery_BZ,NumberOfRatings_BZ,Price_BZ,Year_BZ,twitter_region_BZ
Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
3.2,Pinot Noir Marlborough 2017,New Zealand,Marlborough,Konrad,36,19.98,2017,@Marlborough,Merlot Reserva 2008,Brazil,Vale dos Vinhedos,Pizzato,103.0,15.90,2008,@Vale dos Vinhedos
3.2,Pinot Noir Marlborough 2017,New Zealand,Marlborough,Konrad,36,19.98,2017,@Marlborough,Da'Divas Pinot Noir 2010,Brazil,Encruzilhada do Sul,Lidio Carraro,147.0,14.50,2010,@Encruzilhada do Sul
3.2,Pinot Noir Marlborough 2017,New Zealand,Marlborough,Konrad,36,19.98,2017,@Marlborough,Family Vineyards Pinot Noir 2015,Brazil,Rio Grande do Sul,Miolo,186.0,9.90,2015,@Rio Grande do Sul
3.2,Pinot Noir Marlborough 2017,New Zealand,Marlborough,Konrad,36,19.98,2017,@Marlborough,Agnus Cabernet Sauvignon 2012,Brazil,Encruzilhada do Sul,Lidio Carraro,406.0,11.65,2012,@Encruzilhada do Sul
3.2,Pinot Noir Marlborough 2017,New Zealand,Marlborough,Konrad,36,19.98,2017,@Marlborough,Agnus Merlot 2011,Brazil,Encruzilhada do Sul,Lidio Carraro,722.0,11.65,2011,@Encruzilhada do Sul
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4.2,Ngakirikiri Cabernet Sauvignon 2013,New Zealand,Marlborough,Villa Maria,96,135.00,2013,@Marlborough,Lote 43 Cabernet Sauvignon - Merlot 2012,Brazil,Vale dos Vinhedos,Miolo,4846.0,22.90,2012,@Vale dos Vinhedos
4.2,Gimblett Gravels Sophia 2016,New Zealand,Gimblett Gravels,Craggy Range,98,85.27,2016,@Gimblett Gravels,Lote 43 Cabernet Sauvignon - Merlot 2012,Brazil,Vale dos Vinhedos,Miolo,4846.0,22.90,2012,@Vale dos Vinhedos
4.3,Gimblett Gravels Le Sol 2015,New Zealand,Gimblett Gravels,Craggy Range,114,85.27,2015,@Gimblett Gravels,,,,,,,,
4.3,Te Muna Aroha 2014,New Zealand,Martinborough,Craggy Range,66,85.27,2014,@Martinborough,,,,,,,,


> The lsuffix and rsuffix parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn't true (because, say, we'd renamed them beforehand) we wouldn't need them.

# More .join() + .merge() and .pivot() examples.