En el mundo del análisis de datos, la calidad de los datos es tan importante como el análisis mismo. Los datos crudos a menudo contienen errores, valores faltantes o inconsistencias que pueden afectar la precisión de los resultados.

* Lectura de Datos: Utilizando pd.read_csv() para importar datos desde un archivo CSV.
* Conversión de Tipos de Datos: Convertir columnas, como fechas, a tipos de datos más adecuados.
* Eliminación de Filas o Columnas: Retirar filas o columnas innecesarias.
* Imputación de Datos Faltantes: Rellenar valores faltantes con la media, mediana, o un valor constante.
* Eliminación de Duplicados: Asegurar la integridad de los datos eliminando duplicados.
* Filtrado de Datos: Enfocar el análisis en subconjuntos específicos de datos
* Creación de Nuevas Columnas: Generar columnas derivadas para facilitar el análisis.

#Creacion de Dataframes

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

path = "/content/online_retail.csv"
retail_data = pd.read_csv(path)

# Si fuera excel = pd.read_excel(path)
# Si fuera json = pd.read_json(path)

#print(type(retail_data))
#print(retail_data.shape)
print(retail_data.info)



<bound method DataFrame.info of        InvoiceNo StockCode                          Description  Quantity  \
0         536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1         536365     71053                  WHITE METAL LANTERN         6   
2         536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3         536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4         536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
...          ...       ...                                  ...       ...   
541904    581587     22613          PACK OF 20 SPACEBOY NAPKINS        12   
541905    581587     22899         CHILDREN'S APRON DOLLY GIRL          6   
541906    581587     23254        CHILDRENS CUTLERY DOLLY GIRL          4   
541907    581587     23255      CHILDRENS CUTLERY CIRCUS PARADE         4   
541908    581587     22138        BAKING SET 9 PIECE RETROSPOT          3   

                InvoiceDate  UnitPrice  Cus

In [9]:
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

dt_from_array = pd.DataFrame(data, columns = ["A", "B", "C"])
print(dt_from_array)

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9


In [10]:
data =[[1, "Jhon", 22], [2, "Ana", 24]]

dt_from_list = pd.DataFrame(data, columns = ["ID", "Name", "Age"])
print(dt_from_list)

   ID  Name  Age
0   1  Jhon   22
1   2   Ana   24


In [11]:
data = [{"ID": 1,
         "Name": "Jhon",
         "Age": 22}]
dt_from_dict_list = pd.DataFrame(data)
print(dt_from_dict_list)

   ID  Name  Age
0   1  Jhon   22


In [12]:
data = {"ID": [1,2,3], "Name": ["Jhon", "Ana", "Peter"], "Age": [22, 24, 26]}
dt_from_dict = pd.DataFrame(data)
print(dt_from_dict)

   ID   Name  Age
0   1   Jhon   22
1   2    Ana   24
2   3  Peter   26


In [14]:
#Tambien podemos descomponer los dataframes en series
data = {"ID": pd.Series([1,2,3]),
        "Name": pd.Series(["Jhon", "Ana", "Julia"]),
        "Age": pd.Series([22, 24, 26])}
dt_from_series_dict = pd.DataFrame(data)
print(dt_from_series_dict)

   ID   Name  Age
0   1   Jhon   22
1   2    Ana   24
2   3  Julia   26




---



Si tienes el archivo en drive, puedes usar el codigo o realizarlo directamente en el icono de la izquierda

In [None]:
from google.colab import drive

drive.mount('/content/drive')
file_path = "/content/drive/MyDrive/Colab Notebooks/online_retail.csv"

sales_data = pd.read_csv(file_path)
print(sales_data.head())

Serie  unidimensional, similar a una columna
Dataframe: Bidimensional : tabla

---
Estructuras y funciones

In [15]:
#Nombres de columnas
colums_names = retail_data.columns
num_rows, num_cols = retail_data.shape

print(num_rows, num_cols)
print(colums_names)

541909 8
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')


In [19]:
daily_sales = retail_data["Quantity"]
print(daily_sales)
print(daily_sales[4])

0          6
1          6
2          8
3          6
4          6
          ..
541904    12
541905     6
541906     4
541907     4
541908     3
Name: Quantity, Length: 541909, dtype: int64
6


In [20]:
summary = retail_data.describe()
print(summary)

            Quantity      UnitPrice     CustomerID
count  541909.000000  541909.000000  406829.000000
mean        9.552250       4.611114   15287.690570
std       218.081158      96.759853    1713.600303
min    -80995.000000  -11062.060000   12346.000000
25%         1.000000       1.250000   13953.000000
50%         3.000000       2.080000   15152.000000
75%        10.000000       4.130000   16791.000000
max     80995.000000   38970.000000   18287.000000


In [26]:
mean_value = daily_sales.mean()
std = daily_sales.std()
mean_value = daily_sales.median()
suma_total = daily_sales.sum()
max_value = daily_sales.max()
min_value = daily_sales.min()

print(f"La media es: {mean_value}")
print(f"La desviacion estandar es: {std}")
print(f"La suma_total es: {suma_total}")
print(f"El valor maximo es: {max_value}")
print(f"El valor minimo es: {min_value}")


La media es: 3.0
La desviacion estandar es: 218.08115784986612
La suma_total es: 5176450
El valor maximo es: 80995
El valor minimo es: -80995


Viendo la suma puede que hayan celdas done la informacion sea nula o que no exista y en la suma se este representando, para ello realizamos el conteo

In [27]:
count_values = daily_sales.count() #Conteo de valores no nulos
print(count_values)

541909


Si comparamos la suma total, con el conteo, tenemos que es un resultado diferente|

---
 Metodo Get
Extrae numeros o filas que vamos a especificar

In [32]:
retail_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


## Seleccion de datos

* **iloc** Indexación por Enteros: se utiliza para la indexación basada en enteros, permitiendo seleccionar filas y columnas por su posición.
* **loc ** Indexación por Etiquetas:  se utiliza para la indexación basada en etiquetas, permitiendo seleccionar filas y columnas por sus nombres.

In [33]:
first_row = retail_data.iloc[0]
print(first_row)

InvoiceNo                                  536365
StockCode                                  85123A
Description    WHITE HANGING HEART T-LIGHT HOLDER
Quantity                                        6
InvoiceDate                   2010-12-01 08:26:00
UnitPrice                                    2.55
CustomerID                                17850.0
Country                            United Kingdom
Name: 0, dtype: object


In [34]:
first_five_row = retail_data.iloc[:5]
print(first_five_row)

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
0  2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
1  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
2  2010-12-01 08:26:00       2.75     17850.0  United Kingdom  
3  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
4  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  


In [35]:
subset = retail_data.iloc[:2 , :3] # [filas, columnas]
print(subset)

  InvoiceNo StockCode                         Description
0    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER
1    536365     71053                 WHITE METAL LANTERN


In [37]:
retail_value = retail_data.iloc[1, 2]
print(retail_value)

WHITE METAL LANTERN


In [38]:
row_index_3 = retail_data.loc[3]
print(row_index_3)

InvoiceNo                                   536365
StockCode                                   84029G
Description    KNITTED UNION FLAG HOT WATER BOTTLE
Quantity                                         6
InvoiceDate                    2010-12-01 08:26:00
UnitPrice                                     3.39
CustomerID                                 17850.0
Country                             United Kingdom
Name: 3, dtype: object


In [39]:
row_index_0_to_4 = retail_data.loc[0:4]
print(row_index_0_to_4)

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
0  2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
1  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
2  2010-12-01 08:26:00       2.75     17850.0  United Kingdom  
3  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
4  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  


In [40]:
quantity_column = retail_data.loc[:, "Quantity"]
print(quantity_column)

0          6
1          6
2          8
3          6
4          6
          ..
541904    12
541905     6
541906     4
541907     4
541908     3
Name: Quantity, Length: 541909, dtype: int64


In [41]:
quantity_unitprices_column = retail_data.loc[:,["Quantity", "UnitPrice"]]
print(quantity_unitprices_column)

        Quantity  UnitPrice
0              6       2.55
1              6       3.39
2              8       2.75
3              6       3.39
4              6       3.39
...          ...        ...
541904        12       0.85
541905         6       2.10
541906         4       4.15
541907         4       4.15
541908         3       4.95

[541909 rows x 2 columns]


# Manejo de datos faltantes


In [45]:
missing_data = retail_data.isna()
print(missing_data.head())

   InvoiceNo  StockCode  Description  Quantity  InvoiceDate  UnitPrice  \
0      False      False        False     False        False      False   
1      False      False        False     False        False      False   
2      False      False        False     False        False      False   
3      False      False        False     False        False      False   
4      False      False        False     False        False      False   

   CustomerID  Country  
0       False    False  
1       False    False  
2       False    False  
3       False    False  
4       False    False  


In [46]:
missing_data_count = retail_data.isna().sum()
print('conteo de datos faltantes: ', missing_data_count)

conteo de datos faltantes:  InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64


Una vez verificamos que hay informacion que no existe o se esta perdiendo, tenemos dos opciones:

a. Eliminar esas partes de informacion

b. Llenar estos valores haciendo algun cambio

In [48]:
#Eliminar las filas con datos faltantes
no_missing_data = retail_data.dropna()
print('datos sin filas con valores faltantes', no_missing_data.head())

datos sin filas con valores faltantes   InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

           InvoiceDate  UnitPrice  CustomerID         Country  
0  2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
1  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
2  2010-12-01 08:26:00       2.75     17850.0  United Kingdom  
3  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
4  2010-12-01 08:26:00       3.39     17850.0  United Kingdom  


In [49]:
# Eliminar las columnas con datos faltantes
no_missing_columns = retail_data.dropna(axis = 1)
print('Datos sin columnas con valores faltantes', no_missing_columns)

Datos sin columnas con valores faltantes        InvoiceNo StockCode  Quantity          InvoiceDate  UnitPrice  \
0         536365    85123A         6  2010-12-01 08:26:00       2.55   
1         536365     71053         6  2010-12-01 08:26:00       3.39   
2         536365    84406B         8  2010-12-01 08:26:00       2.75   
3         536365    84029G         6  2010-12-01 08:26:00       3.39   
4         536365    84029E         6  2010-12-01 08:26:00       3.39   
...          ...       ...       ...                  ...        ...   
541904    581587     22613        12  2011-12-09 12:50:00       0.85   
541905    581587     22899         6  2011-12-09 12:50:00       2.10   
541906    581587     23254         4  2011-12-09 12:50:00       4.15   
541907    581587     23255         4  2011-12-09 12:50:00       4.15   
541908    581587     22138         3  2011-12-09 12:50:00       4.95   

               Country  
0       United Kingdom  
1       United Kingdom  
2       United King

In [52]:
# Llenar los valores faltantes por cero
retail_data_fillef_zeros = retail_data.fillna(0)
missing_data_fillef_zeros_count = retail_data_fillef_zeros.isna().sum()
print(retail_data_fillef_zeros)
print(missing_data_fillef_zeros_count)

       InvoiceNo StockCode                          Description  Quantity  \
0         536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1         536365     71053                  WHITE METAL LANTERN         6   
2         536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3         536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4         536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
...          ...       ...                                  ...       ...   
541904    581587     22613          PACK OF 20 SPACEBOY NAPKINS        12   
541905    581587     22899         CHILDREN'S APRON DOLLY GIRL          6   
541906    581587     23254        CHILDRENS CUTLERY DOLLY GIRL          4   
541907    581587     23255      CHILDRENS CUTLERY CIRCUS PARADE         4   
541908    581587     22138        BAKING SET 9 PIECE RETROSPOT          3   

                InvoiceDate  UnitPrice  CustomerID         Country  
0     

In [56]:
missing_data_count =  retail_data['UnitPrice'].isna().sum()
print('conteo de datos faltantes: ', missing_data_count)

conteo de datos faltantes:  0


In [57]:
mean_unit_price = retail_data['UnitPrice'].mean()
retail_data_fille_mean = retail_data['UnitPrice'].fillna(mean_unit_price)
print(retail_data_fille_mean)

0         2.55
1         3.39
2         2.75
3         3.39
4         3.39
          ... 
541904    0.85
541905    2.10
541906    4.15
541907    4.15
541908    4.95
Name: UnitPrice, Length: 541909, dtype: float64


In [None]:
def total_revenue(group):
  return(group["Quantity"]) * (group["UnitPrice"])

revenue_by_country = df.groupby("Country").apply(total_revenue)

x =