# Introduction to Data Science and Machine Learning

<p align="center">
    <img width="699" alt="image" src="https://user-images.githubusercontent.com/49638680/159042792-8510fbd1-c4ac-4a48-8320-bc6c1a49cdae.png">
</p>

---

## Exploratory Data Analysis - Homework

The aim of this notebook is to give you an exercise to perform an exploratory data analysis in order to extract some useful information hidden in data.

We are going to analyse the [Tennis dataset](http://tennis-data.co.uk). In order to guide your analysis, you should try to approach the problem by wondering some questions. The role of the analysis is to find the answers.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from urllib.request import urlopen  
import os.path as osp
import os
import logging
import zipfile
from glob import glob
logging.getLogger().setLevel('INFO')

## Helpers

In [2]:
def download_file(url_str, path):
    url = urlopen(url_str)
    output = open(path, 'wb')       
    output.write(url.read())
    output.close()  
    
def extract_file(archive_path, target_dir):
    zip_file = zipfile.ZipFile(archive_path, 'r')
    zip_file.extractall(target_dir)
    zip_file.close()

## Download the dataset

In [3]:
BASE_URL = 'http://tennis-data.co.uk'
DATA_DIR = "tennis_data"
ATP_DIR = './{}/ATP'.format(DATA_DIR)
WTA_DIR = './{}/WTA'.format(DATA_DIR)

ATP_URLS = [BASE_URL + "/%i/%i.zip" % (i,i) for i in range(2000,2019)]
WTA_URLS = [BASE_URL + "/%iw/%i.zip" % (i,i) for i in range(2007,2019)]

os.makedirs(osp.join(ATP_DIR, 'archives'), exist_ok=True)
os.makedirs(osp.join(WTA_DIR, 'archives'), exist_ok=True)

for files, directory in ((ATP_URLS, ATP_DIR), (WTA_URLS, WTA_DIR)):
    for dl_path in files:
        logging.info("downloading & extracting file %s", dl_path)
        archive_path = osp.join(directory, 'archives', osp.basename(dl_path))
        download_file(dl_path, archive_path)
        extract_file(archive_path, directory)
    
ATP_FILES = sorted(glob("%s/*.xls*" % ATP_DIR))
WTA_FILES = sorted(glob("%s/*.xls*" % WTA_DIR))

df_atp = pd.concat([pd.read_excel(f) for f in ATP_FILES], ignore_index=True)
df_wta = pd.concat([pd.read_excel(f) for f in WTA_FILES], ignore_index=True)

logging.info("%i matches ATP in df_atp", df_atp.shape[0])
logging.info("%i matches WTA in df_wta", df_wta.shape[0])

INFO:root:downloading & extracting file http://tennis-data.co.uk/2000/2000.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2001/2001.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2002/2002.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2003/2003.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2004/2004.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2005/2005.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2006/2006.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2007/2007.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2008/2008.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2009/2009.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2010/2010.zip
INFO:root:downloading & extracting file http://tennis-data.co.uk/2011/2011.zip
INFO:root:downloading & extracting file http://tenni

## Problem description

### The data
The website​ [​http://tennis-data.co.uk/alldata.php​](​http://tennis-data.co.uk/alldata.php​) gathers outcomes of both WTA​ (Women Tennis Association)​ and ATP ​(Association of Tennis Professionals - men only)​ tennis games over several years.
A short description of each variable can be found here : [http://www.tennis-data.co.uk/notes.txt](http://www.tennis-data.co.uk/notes.txt)

### What is expected from you
First of all, answer the following questions.

#### Questions
Please answer the following questions about the dataset with the appropriate line(s) of code.

##### Example

__Question​__: How many ATP matches are there in the dataset? 

__Answer​__: 
```python
len(df_atp)
```

1. Who are the three ATP players with the most wins ?
2. How many sets did the player “​Federer R.” win in total ?
3. How many sets did the player “​Federer R.” win during the years 2016 and 2017 ?
4. For each match, what is the percentage of victories of the winner in the past ?
5. How are (differently) distributed wins of players in the age segments `[16-23]`, `[24-30]` `[30+]`?
6. Does the behaviour in the previous answer changes between men and women?

_Hint_: Careful with null values and how you handle them.

#### Bonus points

* your notebook contains graphics that are both interesting and pretty
* we can go through your entire notebook without frowning
* you teach us something cool 🙂

#### Free Analysis

We would like you to perform some free analysis. For example study distributions, correlations, etc.

---

## Your Work

Have fun!

In [4]:
df_atp.describe()

Unnamed: 0,ATP,Best of,W1,L1,W4,L4,W5,L5,Wsets,CBW,...,UBW,UBL,LBW,LBL,SJW,SJL,MaxW,MaxL,AvgW,AvgL
count,52298.0,52298.0,52035.0,52037.0,4731.0,4731.0,1791.0,1791.0,52074.0,17506.0,...,10671.0,10671.0,28131.0,28142.0,15572.0,15579.0,22745.0,22745.0,22745.0,22745.0
mean,33.222532,3.372366,5.794331,4.056229,5.777003,3.863454,6.637633,3.756002,2.14176,1.81208,...,1.815867,3.542479,1.810226,3.451461,1.796538,3.557943,1.99861,8.326076,1.834821,3.594448
std,18.115493,0.778516,1.239577,1.845206,1.274712,1.895683,2.290596,2.817183,0.460311,0.868254,...,0.996238,3.646316,1.031691,3.075889,1.004273,3.27251,1.628982,397.235666,1.107884,3.28261
min,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.01,1.02,1.0,1.0,1.0,1.01,1.01,1.01,1.01,1.01
25%,19.0,3.0,6.0,3.0,6.0,2.0,6.0,2.0,2.0,1.28,...,1.24,1.75,1.25,1.73,1.22,1.73,1.29,1.85,1.24,1.74
50%,33.0,3.0,6.0,4.0,6.0,4.0,6.0,3.0,2.0,1.55,...,1.5,2.5,1.5,2.5,1.5,2.63,1.57,2.78,1.5,2.55
75%,49.0,3.0,6.0,6.0,6.0,6.0,7.0,5.0,2.0,2.05,...,2.03,3.85,2.0,4.0,2.0,4.0,2.2,4.54,2.06,3.99
max,69.0,5.0,7.0,7.0,7.0,7.0,70.0,68.0,3.0,14.0,...,18.0,60.0,26.0,51.0,19.0,81.0,76.0,42586.0,23.45,36.44


In [5]:
df_atp.info(memory_usage=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52298 entries, 0 to 52297
Data columns (total 54 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   ATP         52298 non-null  int64         
 1   Location    52298 non-null  object        
 2   Tournament  52298 non-null  object        
 3   Date        52298 non-null  datetime64[ns]
 4   Series      52298 non-null  object        
 5   Court       52298 non-null  object        
 6   Surface     52298 non-null  object        
 7   Round       52298 non-null  object        
 8   Best of     52298 non-null  int64         
 9   Winner      52298 non-null  object        
 10  Loser       52298 non-null  object        
 11  WRank       52283 non-null  object        
 12  LRank       52220 non-null  object        
 13  W1          52035 non-null  float64       
 14  L1          52037 non-null  float64       
 15  W2          51526 non-null  object        
 16  L2          51527 non-

In [4]:
df_atp['Lsets'] = df_atp["Lsets"].apply(lambda x: 1.0 if x == '`1' else float(x))

In [5]:
df_atp['Winner'] = df_atp.Winner.apply(lambda x: x.strip())

In [6]:
df_atp

Unnamed: 0,ATP,Location,Tournament,Date,Series,Court,Surface,Round,Best of,Winner,...,UBW,UBL,LBW,LBL,SJW,SJL,MaxW,MaxL,AvgW,AvgL
0,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Dosedel S.,...,,,,,,,,,,
1,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Enqvist T.,...,,,,,,,,,,
2,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Escude N.,...,,,,,,,,,,
3,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Federer R.,...,,,,,,,,,,
4,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Fromberg R.,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52293,67,London,Masters Cup,2018-11-16,Masters Cup,Indoor,Hard,Round Robin,3,Zverev A.,...,,,,,,,1.44,3.40,1.38,3.14
52294,67,London,Masters Cup,2018-11-16,Masters Cup,Indoor,Hard,Round Robin,3,Djokovic N.,...,,,,,,,1.22,6.03,1.17,5.14
52295,67,London,Masters Cup,2018-11-17,Masters Cup,Indoor,Hard,Semifinals,3,Zverev A.,...,,,,,,,3.40,1.45,3.14,1.38
52296,67,London,Masters Cup,2018-11-17,Masters Cup,Indoor,Hard,Semifinals,3,Djokovic N.,...,,,,,,,1.15,7.72,1.12,6.52


1. -> sample il 20% come test
2. -> costruire un modello "stupido" in cui rank maggiore vince.
3. -> calcolare accuracy

In [19]:
df_atp['LRank'] = df_atp.LRank.apply(lambda x: np.nan if x == 'NR' else float(x))

In [21]:
df_atp['WRank'] = df_atp.WRank.apply(lambda x: np.nan if x == 'NR' else float(x))

In [22]:
df_atp['rank_diff'] = df_atp.WRank - df_atp.LRank

In [23]:
df_atp

Unnamed: 0,ATP,Location,Tournament,Date,Series,Court,Surface,Round,Best of,Winner,...,UBL,LBW,LBL,SJW,SJL,MaxW,MaxL,AvgW,AvgL,rank_diff
0,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Dosedel S.,...,,,,,,,,,,-14.0
1,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Enqvist T.,...,,,,,,,,,,-51.0
2,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Escude N.,...,,,,,,,,,,-615.0
3,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Federer R.,...,,,,,,,,,,-22.0
4,1,Adelaide,Australian Hardcourt Championships,2000-01-03,International,Outdoor,Hard,1st Round,3,Fromberg R.,...,,,,,,,,,,-117.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52293,67,London,Masters Cup,2018-11-16,Masters Cup,Indoor,Hard,Round Robin,3,Zverev A.,...,,,,,,1.44,3.40,1.38,3.14,-5.0
52294,67,London,Masters Cup,2018-11-16,Masters Cup,Indoor,Hard,Round Robin,3,Djokovic N.,...,,,,,,1.22,6.03,1.17,5.14,-6.0
52295,67,London,Masters Cup,2018-11-17,Masters Cup,Indoor,Hard,Semifinals,3,Zverev A.,...,,,,,,3.40,1.45,3.14,1.38,2.0
52296,67,London,Masters Cup,2018-11-17,Masters Cup,Indoor,Hard,Semifinals,3,Djokovic N.,...,,,,,,1.15,7.72,1.12,6.52,-5.0


In [27]:
df_atp.loc[4]

ATP                                            1
Location                                Adelaide
Tournament    Australian Hardcourt Championships
Date                         2000-01-03 00:00:00
Series                             International
Court                                    Outdoor
Surface                                     Hard
Round                                  1st Round
Best of                                        3
Winner                               Fromberg R.
Loser                              Woodbridge T.
WRank                                       81.0
LRank                                      198.0
W1                                           7.0
L1                                           6.0
W2                                           5.0
L2                                           7.0
W3                                           6.0
L3                                           4.0
W4                                           NaN
L4                  