### Analisis con Pandas y Kaggle (Core)

El objetivo de esta actividad es poner en práctica todos los conocimientos adquiridos sobre Pandas a través del análisis completo de un dataset. Los estudiantes deben aplicar técnicas de carga, exploración, limpieza, transformación, y agregación de datos para extraer insights valiosos. La actividad no incluye visualización de datos, enfocándose únicamente en el análisis y manipulación de datos con Pandas.

### Cargar los Datos

Carga el archivo CSV en un DataFrame de Pandas.

In [2]:
import pandas as pd
df = pd.read_csv(filepath_or_buffer=('../data/forest_health_data.csv'))

Muestra las primeras 10 filas del DataFrame para confirmar que los datos se han cargado correctamente.

In [3]:
df.head(10)

Unnamed: 0,Plot_ID,Latitude,Longitude,DBH,Tree_Height,Crown_Width_North_South,Crown_Width_East_West,Slope,Elevation,Soil_TN,Soil_TP,Soil_AP,Soil_AN,Menhinick_Index,Gleason_Index,Disturbance_Level,Fire_Risk_Index,Health_Status
0,1,24.981605,-117.040695,29.86204,20.835684,6.147963,4.54272,29.171563,212.518419,0.723065,0.457221,0.189952,0.26885,2.135766,4.897271,0.073175,0.49967,Healthy
1,2,48.028572,-92.066934,28.462986,24.307079,8.248891,5.260921,7.757386,641.640332,0.69041,0.265053,0.169791,0.07326,0.700081,1.068692,0.089478,0.746747,Unhealthy
2,3,39.279758,-68.893791,91.094185,9.013101,7.841448,8.690927,39.257755,2510.612835,0.104797,0.363831,0.092196,0.297665,1.105825,4.790607,0.651974,0.562667,Unhealthy
3,4,33.946339,-78.744258,28.706889,19.496475,2.385099,4.060039,27.590231,2323.628233,0.923347,0.220844,0.305597,0.160819,2.434198,2.47471,0.486941,0.083303,Sub-Healthy
4,5,16.240746,-73.54072,30.835224,18.008888,2.343245,8.826847,7.074175,1116.863805,0.572787,0.316867,0.240929,0.030913,1.821715,1.040362,0.790415,0.18558,Unhealthy
5,6,16.239781,-83.885164,77.142835,25.319251,3.413569,1.79321,43.305213,1192.750821,0.370088,0.173585,0.433522,0.474585,2.819923,3.756629,0.916271,0.219328,Unhealthy
6,7,12.323344,-81.54064,47.725285,27.370438,4.249673,7.991186,23.326446,1647.307857,0.758973,0.154074,0.020894,0.322538,1.571879,4.922161,0.388367,0.257864,Healthy
7,8,44.647046,-70.556304,78.787503,2.34039,4.6761,8.627929,3.28043,100.698914,0.264792,0.291224,0.323715,0.439075,2.674531,4.862503,0.59641,0.565403,Sub-Healthy
8,9,34.0446,-112.523239,11.209785,20.872558,7.117275,2.636359,28.207481,799.608575,0.696575,0.251574,0.38266,0.339648,1.622389,1.643748,0.500697,0.503966,Sub-Healthy
9,10,38.322903,-95.740253,51.319263,3.451402,1.510124,4.873119,11.393951,703.872245,0.049314,0.291871,0.380946,0.181707,2.948121,1.724991,0.981611,0.20316,Unhealthy


Dataset Description
The dataset is a comprehensive collection of ecological and environmental measurements focused on tree characteristics and site conditions. Each record in the dataset represents a distinct tree or plot, with the following features:

* Plot_ID: A unique identifier for each plot where measurements are taken. This helps in distinguishing between different locations within the study area.

* Latitude: The geographical latitude of the plot, measured in degrees. This indicates the north-south position of the plot on the Earth's surface.

* Longitude: The geographical longitude of the plot, measured in degrees. This indicates the east-west position of the plot.

* DBH (Diameter at Breast Height): The diameter of the tree measured at 1.3 meters (or breast height) above ground level, typically expressed in centimeters. This metric is crucial for assessing tree size and health.

* Tree_Height: The total height of the tree from the base to the top, measured in meters. This measurement helps in understanding the growth patterns and ecological role of the tree.

* Crown_Width_North_South: The width of the tree's crown measured in the north-south direction, typically in meters. This dimension can indicate the tree's overall health and competitive status in the ecosystem.

* Crown_Width_East_West: The width of the tree's crown measured in the east-west direction, also typically in meters. Together with crown width in the north-south direction, it provides a complete view of the tree's canopy size.

* Slope: The steepness of the terrain where the tree is located, measured in degrees. This can influence water drainage, soil erosion, and root development.

* Elevation: The height of the plot above sea level, measured in meters. Elevation can affect temperature, precipitation, and overall ecosystem dynamics.

* Temperature: The average temperature recorded at the plot, measured in degrees Celsius. This factor can influence tree growth, health, and species distribution.

* Humidity: The average humidity at the plot, expressed as a percentage. Humidity levels can affect transpiration rates and overall tree health.

* Soil_TN (Total Nitrogen): The concentration of total nitrogen in the soil, measured in grams per kilogram (g/kg). Nitrogen is essential for plant growth and development.

* Soil_TP (Total Phosphorus): The concentration of total phosphorus in the soil, also measured in grams per kilogram (g/kg). Phosphorus is crucial for energy transfer and photosynthesis.

* Soil_AP (Available Phosphorus): The amount of phosphorus readily available to plants in the soil, measured in grams per kilogram (g/kg). This metric helps assess nutrient availability.

* Soil_AN (Available Nitrogen): The amount of nitrogen available for plant uptake in the soil, measured in grams per kilogram (g/kg). This reflects soil fertility.

* Menhinick_Index: A diversity index that reflects species richness in the area. Higher values indicate greater biodiversity.

* Gleason_Index: Another diversity index that accounts for the abundance and richness of species within the community.

* Disturbance_Level: A categorical variable indicating the level of ecological disturbance in the area (0: low, 1: medium, 2: high). This can impact the health and stability of the ecosystem.

* Fire_Risk_Index: A measure of the likelihood of fire occurrence based on environmental conditions, scored between 0 and 1. This can inform management strategies for fire-prone areas.

* Health_Status: A categorical variable indicating the health of the tree, classified as either 'Healthy' or 'Unhealthy.' This is important for understanding the impact of environmental factors on tree vitality.

### Exploración Inicial de los Datos
Muestra las últimas 5 filas del DataFrame.

In [4]:
df.tail(5)

Unnamed: 0,Plot_ID,Latitude,Longitude,DBH,Tree_Height,Crown_Width_North_South,Crown_Width_East_West,Slope,Elevation,Soil_TN,Soil_TP,Soil_AP,Soil_AN,Menhinick_Index,Gleason_Index,Disturbance_Level,Fire_Risk_Index,Health_Status
995,996,13.663283,-84.013139,87.203097,14.378997,9.076576,7.159918,26.08817,892.162899,0.51978,0.42954,0.182925,0.007299,2.850585,2.697353,0.650883,0.007591,Unhealthy
996,997,46.692543,-63.036977,19.940955,11.363233,2.074429,5.528984,30.016659,707.605751,0.173717,0.449267,0.46372,0.437466,2.232273,4.662765,0.177798,0.572671,Sub-Healthy
997,998,15.472745,-125.172939,34.429847,13.048025,3.950586,7.88634,41.02096,1420.453374,0.977936,0.47362,0.350835,0.157126,2.669047,2.991752,0.05124,0.717187,Unhealthy
998,999,48.009494,-126.00617,32.554326,16.838336,8.341708,5.367616,15.552908,2734.468889,0.116845,0.201757,0.132303,0.469601,1.416142,2.595346,0.682962,0.74508,Unhealthy
999,1000,27.840231,-110.246905,87.784333,6.518286,6.375811,2.344435,27.967829,402.992919,0.932625,0.112484,0.284222,0.161888,0.608034,4.001556,0.251079,0.904977,Unhealthy


Utiliza el método info() para obtener información general sobre el DataFrame, incluyendo el número de entradas, nombres de las columnas, tipos de datos y memoria utilizada.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Plot_ID                  1000 non-null   int64  
 1   Latitude                 1000 non-null   float64
 2   Longitude                1000 non-null   float64
 3   DBH                      1000 non-null   float64
 4   Tree_Height              1000 non-null   float64
 5   Crown_Width_North_South  1000 non-null   float64
 6   Crown_Width_East_West    1000 non-null   float64
 7   Slope                    1000 non-null   float64
 8   Elevation                1000 non-null   float64
 9   Soil_TN                  1000 non-null   float64
 10  Soil_TP                  1000 non-null   float64
 11  Soil_AP                  1000 non-null   float64
 12  Soil_AN                  1000 non-null   float64
 13  Menhinick_Index          1000 non-null   float64
 14  Gleason_Index            

Genera estadísticas descriptivas del DataFrame utilizando el método describe().

In [6]:
df.describe()

Unnamed: 0,Plot_ID,Latitude,Longitude,DBH,Tree_Height,Crown_Width_North_South,Crown_Width_East_West,Slope,Elevation,Soil_TN,Soil_TP,Soil_AP,Soil_AN,Menhinick_Index,Gleason_Index,Disturbance_Level,Fire_Risk_Index
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,29.610262,-94.508789,52.728544,15.730501,5.446948,5.48618,22.198898,1498.874791,0.491632,0.250914,0.255317,0.2551,1.743536,2.974499,0.504893,0.490991
std,288.819436,11.685494,20.453293,27.614049,8.021702,2.581289,2.602753,13.038014,826.251755,0.279412,0.139666,0.141794,0.146605,0.71955,1.175643,0.289751,0.290822
min,1.0,10.185281,-129.774722,5.001105,2.018295,1.000276,1.055654,0.064275,100.698914,0.010241,0.005026,0.005467,0.005078,0.503009,1.005335,0.00132,0.00031
25%,250.75,19.438931,-113.124801,29.828343,8.773222,3.204766,3.24442,10.809975,784.368948,0.25689,0.135751,0.130052,0.130452,1.119647,1.939332,0.254679,0.236863
50%,500.5,29.872295,-93.688627,52.558322,15.55982,5.451383,5.413625,21.808936,1503.573023,0.483914,0.250117,0.255651,0.249754,1.7246,2.929722,0.500965,0.492343
75%,750.25,39.772784,-76.767446,77.114835,22.651143,7.659941,7.658666,34.040896,2171.952127,0.718746,0.369428,0.379568,0.387961,2.383008,4.008349,0.768492,0.746786
max,1000.0,49.988707,-60.041039,99.792981,29.987616,9.979745,9.994153,44.975731,2996.823629,0.996053,0.499755,0.499838,0.499671,2.996746,4.995379,0.999805,0.999925


### Limpieza de Datos
Identifica y maneja los datos faltantes utilizando técnicas apropiadas (relleno con valores estadísticos, interpolación, eliminación, etc.).

In [7]:
qsna=df.shape[0]-df.isnull().sum(axis=0)
qna=df.isnull().sum(axis=0)
ppna=round(100*(df.isnull().sum(axis=0)/df.shape[0]),2)
aux= {'datos sin NAs en q': qsna, 'Na en q': qna ,'Na en %': ppna}
na=pd.DataFrame(data=aux)
na.sort_values(by='Na en %',ascending=False)

Unnamed: 0,datos sin NAs en q,Na en q,Na en %
Plot_ID,1000,0,0.0
Latitude,1000,0,0.0
Fire_Risk_Index,1000,0,0.0
Disturbance_Level,1000,0,0.0
Gleason_Index,1000,0,0.0
Menhinick_Index,1000,0,0.0
Soil_AN,1000,0,0.0
Soil_AP,1000,0,0.0
Soil_TP,1000,0,0.0
Soil_TN,1000,0,0.0


Corrige los tipos de datos si es necesario (por ejemplo, convertir cadenas a fechas).

In [8]:
df['Health_Status'] = df.Health_Status.astype('category')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   Plot_ID                  1000 non-null   int64   
 1   Latitude                 1000 non-null   float64 
 2   Longitude                1000 non-null   float64 
 3   DBH                      1000 non-null   float64 
 4   Tree_Height              1000 non-null   float64 
 5   Crown_Width_North_South  1000 non-null   float64 
 6   Crown_Width_East_West    1000 non-null   float64 
 7   Slope                    1000 non-null   float64 
 8   Elevation                1000 non-null   float64 
 9   Soil_TN                  1000 non-null   float64 
 10  Soil_TP                  1000 non-null   float64 
 11  Soil_AP                  1000 non-null   float64 
 12  Soil_AN                  1000 non-null   float64 
 13  Menhinick_Index          1000 non-null   float64 
 14  Gleason_I

Elimina duplicados si los hay.

In [14]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   Plot_ID                  1000 non-null   int64   
 1   Latitude                 1000 non-null   float64 
 2   Longitude                1000 non-null   float64 
 3   DBH                      1000 non-null   float64 
 4   Tree_Height              1000 non-null   float64 
 5   Crown_Width_North_South  1000 non-null   float64 
 6   Crown_Width_East_West    1000 non-null   float64 
 7   Slope                    1000 non-null   float64 
 8   Elevation                1000 non-null   float64 
 9   Soil_TN                  1000 non-null   float64 
 10  Soil_TP                  1000 non-null   float64 
 11  Soil_AP                  1000 non-null   float64 
 12  Soil_AN                  1000 non-null   float64 
 13  Menhinick_Index          1000 non-null   float64 
 14  Gleason_I

### Transformación de Datos

Crea nuevas columnas basadas en operaciones con las columnas existentes (por ejemplo, calcular ingresos a partir de ventas y precios)

In [16]:
# Creacion de columnas de ratio de nitrogeno disponible sobre el total de nitrogeno en el suelo
df['Soil_N_ratio'] = df.apply(lambda row: row['Soil_AN'] / row['Soil_TN'], axis=1)
df['Soil_P_ratio'] = df.apply(lambda row: row['Soil_AP'] / row['Soil_TP'], axis=1)
df.head()

Unnamed: 0,Plot_ID,Latitude,Longitude,DBH,Tree_Height,Crown_Width_North_South,Crown_Width_East_West,Slope,Elevation,Soil_TN,Soil_TP,Soil_AP,Soil_AN,Menhinick_Index,Gleason_Index,Disturbance_Level,Fire_Risk_Index,Health_Status,Soil_N_ratio,Soil_P_ratio
0,1,24.981605,-117.040695,29.86204,20.835684,6.147963,4.54272,29.171563,212.518419,0.723065,0.457221,0.189952,0.26885,2.135766,4.897271,0.073175,0.49967,Healthy,0.37182,0.415449
1,2,48.028572,-92.066934,28.462986,24.307079,8.248891,5.260921,7.757386,641.640332,0.69041,0.265053,0.169791,0.07326,0.700081,1.068692,0.089478,0.746747,Unhealthy,0.106111,0.640593
2,3,39.279758,-68.893791,91.094185,9.013101,7.841448,8.690927,39.257755,2510.612835,0.104797,0.363831,0.092196,0.297665,1.105825,4.790607,0.651974,0.562667,Unhealthy,2.840408,0.253404
3,4,33.946339,-78.744258,28.706889,19.496475,2.385099,4.060039,27.590231,2323.628233,0.923347,0.220844,0.305597,0.160819,2.434198,2.47471,0.486941,0.083303,Sub-Healthy,0.17417,1.383769
4,5,16.240746,-73.54072,30.835224,18.008888,2.343245,8.826847,7.074175,1116.863805,0.572787,0.316867,0.240929,0.030913,1.821715,1.040362,0.790415,0.18558,Unhealthy,0.053969,0.760347


Normaliza o estandariza columnas si es necesario.

In [29]:
def estandarizar_columna(columna):
    return (columna - columna.mean()) / columna.std()
# Columnas a estandarizar
soil_columns = ['Soil_TN', 'Soil_TP', 'Soil_AP', 'Soil_AN']
df[soil_columns] = df[soil_columns].apply(estandarizar_columna)
df[soil_columns].describe()

Unnamed: 0,Soil_TN,Soil_TP,Soil_AP,Soil_AN
count,1000.0,1000.0,1000.0,1000.0
mean,-8.881784e-18,-7.105427e-18,-2.6645350000000002e-18,1.2434500000000001e-17
std,1.0,1.0,1.0,1.0
min,-1.722871,-1.76054,-1.762065,-1.70541
25%,-0.8401275,-0.824555,-0.8834286,-0.8502299
50%,-0.0276215,-0.005699695,0.002354238,-0.03646473
75%,0.8128286,0.848554,0.8762708,0.9062552
max,1.805293,1.781692,1.724476,1.668235


In [30]:
df['Fire_Risk_Index'] = df[['Fire_Risk_Index']].apply(estandarizar_columna)
df['Fire_Risk_Index'].describe()

count    1.000000e+03
mean     1.421085e-17
std      1.000000e+00
min     -1.687225e+00
25%     -8.738301e-01
50%      4.649388e-03
75%      8.795590e-01
max      1.749986e+00
Name: Fire_Risk_Index, dtype: float64

Clasifica los datos en categorías relevantes.

In [34]:
# Definir categorias basado en cuartiles
def categorize(value, q1, q3):
    if value <= q1:
        return 'Low'
    elif value <= q3:
        return 'Moderate'
    else:
        return 'High'
# Añadir columna de categorizacion de nutrientes de suelo
for col in soil_columns:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    df[f'{col}_category'] = df[col].apply(categorize, args=(q1, q3))

df['Fire_Risk_Category'] = df['Fire_Risk_Index'].apply(categorize, args=(q1, q3))

df.head()


Unnamed: 0,Plot_ID,Latitude,Longitude,DBH,Tree_Height,Crown_Width_North_South,Crown_Width_East_West,Slope,Elevation,Soil_TN,...,Disturbance_Level,Fire_Risk_Index,Health_Status,Soil_N_ratio,Soil_P_ratio,Soil_TN_category,Soil_TP_category,Soil_AP_category,Soil_AN_category,Fire_Risk_Category
0,1,24.981605,-117.040695,29.86204,20.835684,6.147963,4.54272,29.171563,212.518419,0.828287,...,0.073175,0.029843,Healthy,0.37182,0.415449,High,High,Moderate,Moderate,Moderate
1,2,48.028572,-92.066934,28.462986,24.307079,8.248891,5.260921,7.757386,641.640332,0.711417,...,0.089478,0.879424,Unhealthy,0.106111,0.640593,Moderate,Moderate,Moderate,Low,Moderate
2,3,39.279758,-68.893791,91.094185,9.013101,7.841448,8.690927,39.257755,2510.612835,-1.38446,...,0.651974,0.246459,Unhealthy,2.840408,0.253404,Low,Moderate,Low,Moderate,Moderate
3,4,33.946339,-78.744258,28.706889,19.496475,2.385099,4.060039,27.590231,2323.628233,1.545083,...,0.486941,-1.401852,Sub-Healthy,0.17417,1.383769,High,Moderate,Moderate,Moderate,Low
4,5,16.240746,-73.54072,30.835224,18.008888,2.343245,8.826847,7.074175,1116.863805,0.290452,...,0.790415,-1.050166,Unhealthy,0.053969,0.760347,Moderate,Moderate,Moderate,Low,Low


### Análisis de Datos

Realiza agrupaciones de datos utilizando groupby para obtener insights específicos (por ejemplo, ventas por producto, ventas por región, etc.).

In [46]:
# Agrupar estado de salud por tipo de suelo
health_by_soil = df.groupby(['Soil_TN_category', 'Soil_TP_category', 'Soil_AP_category', 'Soil_AN_category'])['Health_Status'].value_counts()
# Agrupar riesgo de incendio por nivel de nutrientes
fire_risk_by_soil = df.groupby(['Soil_TN_category', 'Soil_TP_category', 'Soil_AP_category', 'Soil_AN_category'])['Fire_Risk_Category'].value_counts()

health_by_soil
fire_risk_by_soil

Soil_TN_category  Soil_TP_category  Soil_AP_category  Soil_AN_category  Fire_Risk_Category
High              High              High              High              Low                    1
                                                      Low               High                   4
                                                                        Moderate               2
                                                      Moderate          Moderate               3
                                                                        Low                    1
                                                                                              ..
Moderate          Moderate          Moderate          Low               High                  10
                                                                        Moderate              10
                                                      Moderate          Moderate              28
                                    

Aplica funciones de agregación como sum, mean, count, min, max, std, y var.

In [45]:
aggregated_health_by_soil = df.groupby(
    ['Soil_TN_category', 'Soil_TP_category', 'Soil_AP_category', 'Soil_AN_category']
).agg(
    count_health_status=('Health_Status', 'count'),
    min_fire_risk=('Fire_Risk_Index', 'min'),
    max_fire_risk=('Fire_Risk_Index', 'max'),
    mean_fire_risk=('Fire_Risk_Index', 'mean'),
    std_fire_risk=('Fire_Risk_Index', 'std'),
    var_fire_risk=('Fire_Risk_Index', 'var')
).reset_index()
aggregated_health_by_soil

Unnamed: 0,Soil_TN_category,Soil_TP_category,Soil_AP_category,Soil_AN_category,count_health_status,min_fire_risk,max_fire_risk,mean_fire_risk,std_fire_risk,var_fire_risk
0,High,High,High,High,1,-1.020657,-1.020657,-1.020657,,
1,High,High,High,Low,6,-0.422450,1.466090,0.856736,0.664902,0.442095
2,High,High,High,Moderate,4,-1.451263,0.500822,-0.301441,0.867334,0.752268
3,High,High,Low,High,2,0.333952,1.680414,1.007183,0.952092,0.906480
4,High,High,Low,Low,10,-1.497828,1.172784,-0.530638,1.023541,1.047637
...,...,...,...,...,...,...,...,...,...,...
76,Moderate,Moderate,Low,Low,21,-1.565623,1.541172,0.312430,1.092434,1.193413
77,Moderate,Moderate,Low,Moderate,33,-1.618850,1.739112,0.032878,1.086136,1.179692
78,Moderate,Moderate,Moderate,High,30,-1.609474,1.414327,0.173081,0.956973,0.915797
79,Moderate,Moderate,Moderate,Low,31,-1.686385,1.729605,0.016061,1.170595,1.370293
