# Ejercicio: Data input, output e iteración

En esta primera parte, aprenderemos a:

* Descargar ficheros y renombrarlos por comandos de Linux (mucho más eficiente que recurrir a librerías de Python). Esta será una práctica a la que nos iremos habituando a lo largo del curso.
* Leer y escribir ficheros en formato `csv` con la librería `pandas`.
* Iterar sobre las filas de un objeto `pd.DataFrame`.

En lo relativo a la primera parte, para la descarga emplearemos el comando `wget` seguido de la URL en cuestión. Para la segunda parte, usaremos la estructura `mv <source>  <destination>`:

In [1]:
# Descargamos un fichero csv
!wget https://www.stats.govt.nz/assets/Uploads/Gross-domestic-product/Gross-domestic-product-December-2021-quarter/Download-data/gross-domestic-product-December-2021-quarter-csv.csv
# Lo renombramos para que sea más sencillo acceder a él
!mv /content/gross-domestic-product-December-2021-quarter-csv.csv /content/dataset.csv

--2022-06-03 22:17:17--  https://www.stats.govt.nz/assets/Uploads/Gross-domestic-product/Gross-domestic-product-December-2021-quarter/Download-data/gross-domestic-product-December-2021-quarter-csv.csv
Resolving www.stats.govt.nz (www.stats.govt.nz)... 45.60.13.104
Connecting to www.stats.govt.nz (www.stats.govt.nz)|45.60.13.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18193842 (17M) [text/csv]
Saving to: ‘gross-domestic-product-December-2021-quarter-csv.csv’


2022-06-03 22:17:22 (5.81 MB/s) - ‘gross-domestic-product-December-2021-quarter-csv.csv’ saved [18193842/18193842]



Seguidamente, procedemos a leerlo:

In [2]:
# Librerías
import os
import pandas as pd

# Leemos archivo
df = pd.read_csv(os.path.join(os.getcwd(),'dataset.csv'))
# OPCIONAL: Guardamos el pd.DataFrame
# df.to_csv(os.path.join(os.getcwd(),'dataset.csv'))

Es conveniente hacer uso de la librería `os`, ya que a bajo nivel nos gestiona cuál es nuestro directorio activo actual (`os.getcwd()`) y nos permite fácilmente concatenar rutas para encontrar archivos (`os.path.join()`).

Veamos ahora cómo realizar iteraciones. En `pandas` existen esencialmente dos métodos para ello:

* `iterrows`: Enlace a [documentación](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html).
* `itertuples`: Enlace a [documentación](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html).

Tu objetivo será analizar la documentación e implementar un bucle donde se itere sobre las filas del `DataFrame` dejando un sencillo `pass` dentro del bucle, con el decorador `%%timeit` para evaluar qué método es más rápido. A continuación se muestra un ejemplo:

```python
%%timeit
for elem in [1,2,3,4,5]:
    pass
```

* Método iterrows:

In [3]:
%%timeit

for row in df.iterrows():

    pass


1 loop, best of 5: 3.25 s per loop


1.19 s ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

*Tipo de respuesta esperada:*

```python
1 loop, best of 5: 3.28 s per loop
```

* Método `itertuples` por defecto:

In [4]:
%%timeit

for row in df.itertuples():

    pass

10 loops, best of 5: 122 ms per loop


*Tipo de respuesta esperada:*

```python
1 loop, best of 5: 117 ms per loop
```

* Método `itertuples` sin contener los índices:

In [5]:
%%timeit

for row in df.itertuples(index=False):
    pass


10 loops, best of 5: 117 ms per loop


*Tipo de respuesta esperada:*

```python
10 loops, best of 5: 118 ms per loop
```

# Ejercicio Ecommerce

En este ejercicio, se proporcionan datos ficticios sobre compras realizadas por Amazon, por lo que es posible que exista alguna incoherencia en los datos. Se trata de ir realizando cada uno de los ejercicios propuestos progresivamente. Hay más de una forma de hacerlos, y todos se pueden responder con una línea.

Para empezar, descargaremos los datos:

In [7]:
!wget https://raw.githubusercontent.com/MatinMasimli/DataScience_ML_Udemy_Exercises/master/Ecommerce_purchases/Ecommerce%20Purchases.csv

--2022-06-03 22:19:09--  https://raw.githubusercontent.com/MatinMasimli/DataScience_ML_Udemy_Exercises/master/Ecommerce_purchases/Ecommerce%20Purchases.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2745852 (2.6M) [text/plain]
Saving to: ‘Ecommerce Purchases.csv.1’


2022-06-03 22:19:09 (189 MB/s) - ‘Ecommerce Purchases.csv.1’ saved [2745852/2745852]



Cargamos la librería pandas como 'pd' y numpy como 'np':

In [8]:
import numpy as np
import pandas as pd

Ahora, leemos el DataFrame y mostramos sus cinco primeras filas:

In [9]:
data = pd.read_csv(os.path.join(os.getcwd(),'Ecommerce Purchases.csv'))

In [10]:
data[0:5]

Unnamed: 0,Address,Lot,AM or PM,Browser Info,Company,Credit Card,CC Exp Date,CC Security Code,CC Provider,Email,Job,IP Address,Language,Purchase Price
0,"16629 Pace Camp Apt. 448\nAlexisborough, NE 77...",46 in,PM,Opera/9.56.(X11; Linux x86_64; sl-SI) Presto/2...,Martinez-Herman,6011929061123406,02/20,900,JCB 16 digit,pdunlap@yahoo.com,"Scientist, product/process development",149.146.147.205,el,98.14
1,"9374 Jasmine Spurs Suite 508\nSouth John, TN 8...",28 rn,PM,Opera/8.93.(Windows 98; Win 9x 4.90; en-US) Pr...,"Fletcher, Richards and Whitaker",3337758169645356,11/18,561,Mastercard,anthony41@reed.com,Drilling engineer,15.160.41.51,fr,70.73
2,Unit 0065 Box 5052\nDPO AP 27450,94 vE,PM,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,"Simpson, Williams and Pham",675957666125,08/19,699,JCB 16 digit,amymiller@morales-harrison.com,Customer service manager,132.207.160.22,de,0.95
3,"7780 Julia Fords\nNew Stacy, WA 45798",36 vm,PM,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0 ...,"Williams, Marshall and Buchanan",6011578504430710,02/24,384,Discover,brent16@olson-robinson.info,Drilling engineer,30.250.74.19,es,78.04
4,"23012 Munoz Drive Suite 337\nNew Cynthia, TX 5...",20 IE,AM,Opera/9.58.(X11; Linux x86_64; it-IT) Presto/2...,"Brown, Watson and Andrews",6011456623207998,10/25,678,Diners Club / Carte Blanche,christopherwright@gmail.com,Fine artist,24.140.33.94,es,77.82


In [11]:
data.head(5)

Unnamed: 0,Address,Lot,AM or PM,Browser Info,Company,Credit Card,CC Exp Date,CC Security Code,CC Provider,Email,Job,IP Address,Language,Purchase Price
0,"16629 Pace Camp Apt. 448\nAlexisborough, NE 77...",46 in,PM,Opera/9.56.(X11; Linux x86_64; sl-SI) Presto/2...,Martinez-Herman,6011929061123406,02/20,900,JCB 16 digit,pdunlap@yahoo.com,"Scientist, product/process development",149.146.147.205,el,98.14
1,"9374 Jasmine Spurs Suite 508\nSouth John, TN 8...",28 rn,PM,Opera/8.93.(Windows 98; Win 9x 4.90; en-US) Pr...,"Fletcher, Richards and Whitaker",3337758169645356,11/18,561,Mastercard,anthony41@reed.com,Drilling engineer,15.160.41.51,fr,70.73
2,Unit 0065 Box 5052\nDPO AP 27450,94 vE,PM,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,"Simpson, Williams and Pham",675957666125,08/19,699,JCB 16 digit,amymiller@morales-harrison.com,Customer service manager,132.207.160.22,de,0.95
3,"7780 Julia Fords\nNew Stacy, WA 45798",36 vm,PM,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0 ...,"Williams, Marshall and Buchanan",6011578504430710,02/24,384,Discover,brent16@olson-robinson.info,Drilling engineer,30.250.74.19,es,78.04
4,"23012 Munoz Drive Suite 337\nNew Cynthia, TX 5...",20 IE,AM,Opera/9.58.(X11; Linux x86_64; it-IT) Presto/2...,"Brown, Watson and Andrews",6011456623207998,10/25,678,Diners Club / Carte Blanche,christopherwright@gmail.com,Fine artist,24.140.33.94,es,77.82


¿Qué dimensiones tienen nuestros datos?

In [13]:
data.shape

(10000, 14)

In [14]:
variable_resultado = 3

assert(variable_resultado == 3)

*Respuesta esperada:*

```python
(10000,14)
```

¿Cuál es el precio medio de compra?

In [15]:
data['Purchase Price'].mean()

50.34730200000025

*Respuesta esperada:*

```python
50.34730200000025
```

Cuáles fueron, respectivamente, los precios mayores y menores de compra?

In [None]:
data['Purchase Price'].max()

99.99

*Respuesta esperada:*

```python
99.989999999999995
```

In [None]:
max_valor = data['Purchase Price'].max()

condicion = data['Purchase Price'] == max_valor

data[condicion]

Unnamed: 0,Address,Lot,AM or PM,Browser Info,Company,Credit Card,CC Exp Date,CC Security Code,CC Provider,Email,Job,IP Address,Language,Purchase Price
2092,"63773 Shelton Greens\nAshleyton, MA 00493",56 lu,AM,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,Pitts Group,4292741269160,06/18,824,Maestro,heatherwoodard@lloyd.com,"Surveyor, hydrographic",172.197.216.229,el,99.99
7807,"PSC 6177, Box 1004\nAPO AA 57143-1269",64 Nf,AM,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5_4)...,"Porter, Johnson and Pratt",30109394842259,11/25,918,VISA 16 digit,kelli72@gmail.com,"Surveyor, insurance",89.51.92.242,de,99.99


In [16]:
data['Purchase Price'].min()

0.0

*Respuesta esperada:*

```python
0.0
```

¿Cuánta gente tiene indicado como inglés (valor 'en') el idioma preferente?

In [17]:
sum(data['Language'] == 'en')

1098

In [18]:
len(data[data['Language'] == 'en'])

1098

In [19]:
data[data["Language"] == "en"].shape

(1098, 14)

*Respuesta esperada:*

```python
1098
```

¿Cuánta gente dispone del título profesional `'Lawyer'`?


In [None]:
data['Job'].unique()

In [21]:
data.columns

Index(['Address', 'Lot', 'AM or PM', 'Browser Info', 'Company', 'Credit Card',
       'CC Exp Date', 'CC Security Code', 'CC Provider', 'Email', 'Job',
       'IP Address', 'Language', 'Purchase Price'],
      dtype='object')

In [22]:
sum(data['Job'] == 'Lawyer')

30

*Respuesta esperada:*

```python
30
```

¿Cuántas personas hicieron compras durante la mañana, y cuántas durante la tarde?

*(Pista: Use [value_counts()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html))*

In [None]:
columnas_antiguas = data.columns.tolist()

columnas_normalizadas = [x.upper().strip().replace(' ','_') for x in columnas_antiguas]

data.rename(columns = dict(zip(columnas_antiguas, columnas_normalizadas)), inplace = True)

In [None]:
data['Purchase Price al cuadrado'] = data['Purchase Price']**2

In [26]:
data['AM or PM'].value_counts()[::-1]

AM    4932
PM    5068
Name: AM or PM, dtype: int64

*Respuesta esperada:*

```python
PM    5068
AM    4932
Name: AM or PM, dtype: int64
```

¿Cuáles son las cinco profesiones más habituales?

In [27]:
data["Job"].value_counts()[:5]

Interior and spatial designer    31
Lawyer                           30
Social researcher                28
Purchasing manager               27
Designer, jewellery              27
Name: Job, dtype: int64

In [28]:
data["Job"].value_counts().head(5)

Interior and spatial designer    31
Lawyer                           30
Social researcher                28
Purchasing manager               27
Designer, jewellery              27
Name: Job, dtype: int64

*Respuesta esperada:*

```python
Interior and spatial designer    31
Lawyer                           30
Social researcher                28
Purchasing manager               27
Designer, jewellery              27
Name: Job, dtype: int64
```

Algún usuario hizo una compra que vino en el lote `90 WT`. ¿Cuál fue el precio de esa transacción?

In [29]:
data[data['Lot'] == '90 WT']

Unnamed: 0,Address,Lot,AM or PM,Browser Info,Company,Credit Card,CC Exp Date,CC Security Code,CC Provider,Email,Job,IP Address,Language,Purchase Price
513,"50398 Mccoy Rest Suite 597\nSouth Garyborough,...",90 WT,AM,Mozilla/5.0 (iPod; U; CPU iPhone OS 3_2 like M...,Bright PLC,630438419693,11/19,173,American Express,jesse00@page.net,Energy engineer,156.70.208.94,ru,75.1


In [None]:
data[0:2]

In [31]:
condicion = data['Lot'] == "90 WT"


data.loc[condicion]

Unnamed: 0,Address,Lot,AM or PM,Browser Info,Company,Credit Card,CC Exp Date,CC Security Code,CC Provider,Email,Job,IP Address,Language,Purchase Price
513,"50398 Mccoy Rest Suite 597\nSouth Garyborough,...",90 WT,AM,Mozilla/5.0 (iPod; U; CPU iPhone OS 3_2 like M...,Bright PLC,630438419693,11/19,173,American Express,jesse00@page.net,Energy engineer,156.70.208.94,ru,75.1


*Respuesta esperada:*

```python
513    75.1
Name: Purchase Price, dtype: float64
```

¿Cuál es el email de la persona con tarjeta de crédito `4926535242672853`?

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Address                     10000 non-null  object 
 1   Lot                         10000 non-null  object 
 2   AM or PM                    10000 non-null  object 
 3   Browser Info                10000 non-null  object 
 4   Company                     10000 non-null  object 
 5   Credit Card                 10000 non-null  int64  
 6   CC Exp Date                 10000 non-null  object 
 7   CC Security Code            10000 non-null  int64  
 8   CC Provider                 10000 non-null  object 
 9   Email                       10000 non-null  object 
 10  Job                         10000 non-null  object 
 11  IP Address                  10000 non-null  object 
 12  Language                    10000 non-null  object 
 13  Purchase Price              1000

In [32]:
tarjeta_credito = 4926535242672853

data[data['Credit Card'] == tarjeta_credito]['Email']

1234    bondellen@williams-garza.com
Name: Email, dtype: object

*Respuesta esperada:*

```python
1234    bondellen@williams-garza.com
Name: Email, dtype: object
```

¿Cuánta gente tiene `American Express` como proveedor de crédito e hizo una compra superior a `$95`?

In [35]:
condicion1 = data['CC Provider'] == 'American Express'

condicion2 = data['Purchase Price'] > 95 

data[condicion1 & condicion2].shape[0]

39

*Respuesta esperada:*

```python
39
```

¿Cuánta gente tiene una tarjeta de crédito que expira en 2025?

In [None]:
int('02/20'.split('/')[1])

20

In [None]:
def arreglar_columna(x):

    try:

        if type(x) == str:

            return x[-2:]

        else:

            return x

    except:

        return x     


data['CC Exp Date'].apply(lambda x: arreglar_columna(x))

In [None]:
data['CC Exp Date'] = [int(x[1]) for x in data['CC Exp Date'].str.split('/')]

In [None]:
data['CC Exp Date'].value_counts()[25]

1033

In [36]:
data['CC Exp Date'].apply(lambda x: int(x[-2:]))

0       20
1       18
2       19
3       24
4       25
        ..
9995    22
9996    25
9997    21
9998    17
9999    19
Name: CC Exp Date, Length: 10000, dtype: int64

*Respuesta esperada:*

```python
1033
```

¿Cuáles fueron los cinco *email hosts* más populares (e.g., gmail, yahoo,...)?

In [None]:
len(['@' in x for x in data['Email']])

10000

In [None]:
data['Host'] = [x[1] for x in data['Email'].str.split('@')]


In [None]:
data['Host'].value_counts().head(5)

hotmail.com     1638
yahoo.com       1616
gmail.com       1605
smith.com         42
williams.com      37
Name: Host, dtype: int64

*Respuesta esperada:*

```python
hotmail.com     1638
yahoo.com       1616
gmail.com       1605
smith.com         42
williams.com      37
Name: Email, dtype: int64
```

In [37]:
data

Unnamed: 0,Address,Lot,AM or PM,Browser Info,Company,Credit Card,CC Exp Date,CC Security Code,CC Provider,Email,Job,IP Address,Language,Purchase Price
0,"16629 Pace Camp Apt. 448\nAlexisborough, NE 77...",46 in,PM,Opera/9.56.(X11; Linux x86_64; sl-SI) Presto/2...,Martinez-Herman,6011929061123406,02/20,900,JCB 16 digit,pdunlap@yahoo.com,"Scientist, product/process development",149.146.147.205,el,98.14
1,"9374 Jasmine Spurs Suite 508\nSouth John, TN 8...",28 rn,PM,Opera/8.93.(Windows 98; Win 9x 4.90; en-US) Pr...,"Fletcher, Richards and Whitaker",3337758169645356,11/18,561,Mastercard,anthony41@reed.com,Drilling engineer,15.160.41.51,fr,70.73
2,Unit 0065 Box 5052\nDPO AP 27450,94 vE,PM,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,"Simpson, Williams and Pham",675957666125,08/19,699,JCB 16 digit,amymiller@morales-harrison.com,Customer service manager,132.207.160.22,de,0.95
3,"7780 Julia Fords\nNew Stacy, WA 45798",36 vm,PM,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0 ...,"Williams, Marshall and Buchanan",6011578504430710,02/24,384,Discover,brent16@olson-robinson.info,Drilling engineer,30.250.74.19,es,78.04
4,"23012 Munoz Drive Suite 337\nNew Cynthia, TX 5...",20 IE,AM,Opera/9.58.(X11; Linux x86_64; it-IT) Presto/2...,"Brown, Watson and Andrews",6011456623207998,10/25,678,Diners Club / Carte Blanche,christopherwright@gmail.com,Fine artist,24.140.33.94,es,77.82
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,"966 Castaneda Locks\nWest Juliafurt, CO 96415",92 XI,PM,Mozilla/5.0 (Windows NT 5.1) AppleWebKit/5352 ...,Randall-Sloan,342945015358701,03/22,838,JCB 15 digit,iscott@wade-garner.com,Printmaker,29.73.197.114,it,82.21
9996,"832 Curtis Dam Suite 785\nNorth Edwardburgh, T...",41 JY,AM,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,"Hale, Collins and Wilson",210033169205009,07/25,207,JCB 16 digit,mary85@hotmail.com,Energy engineer,121.133.168.51,pt,25.63
9997,Unit 4434 Box 6343\nDPO AE 28026-0283,74 Zh,AM,Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_7...,Anderson Ltd,6011539787356311,05/21,1,VISA 16 digit,tyler16@gmail.com,Veterinary surgeon,156.210.0.254,el,83.98
9998,"0096 English Rest\nRoystad, IA 12457",74 cL,PM,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_8;...,Cook Inc,180003348082930,11/17,987,American Express,elizabethmoore@reid.net,Local government officer,55.78.26.143,es,38.84
