# Actividad en clase: Aggregations & GroupBy

- Jorge Emiliano Pomar A01709338
- 18 de marzo de 2025

**Índice**<a id='toc0_'></a>     
  - [Planteamiento del problema](#toc1_1_)    
  - [Generar DataFrame](#toc1_2_)    
  - [Explorar contenido mediante simple agregación](#toc1_3_)    
  - [Procesamiento de la información mediante agrupamiento:](#toc1_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Planteamiento del problema](#toc0_)

Se desea hacer un análisis general de los datos de evaluaciones de algunos profesores. Para esto, se tiene el archivo disponible en el siguiente URL:  

`` teachingratings.csv ``


Este contiene evaluaciones realizadas por estudiantes a profesores de una Universidad. 

## <a id='toc1_2_'></a>[Generar DataFrame](#toc0_)

In [1]:
import numpy as np
import pandas as pd

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args

    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)

    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

In [2]:
df = pd.read_csv("teachingratings.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 463 entries, 0 to 462
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   minority         463 non-null    object 
 1   age              463 non-null    int64  
 2   gender           463 non-null    object 
 3   credits          463 non-null    object 
 4   beauty           463 non-null    float64
 5   eval             463 non-null    float64
 6   division         463 non-null    object 
 7   native           463 non-null    object 
 8   tenure           463 non-null    object 
 9   students         463 non-null    int64  
 10  allstudents      463 non-null    int64  
 11  prof             463 non-null    int64  
 12  PrimaryLast      463 non-null    int64  
 13  vismin           463 non-null    int64  
 14  female           463 non-null    int64  
 15  single_credit    463 non-null    int64  
 16  upper_division   463 non-null    int64  
 17  English_speaker 

In [4]:
df.describe(include="object").T

Unnamed: 0,count,unique,top,freq
minority,463,2,no,399
gender,463,2,male,268
credits,463,2,more,436
division,463,2,upper,306
native,463,2,yes,435
tenure,463,2,yes,361


In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,463.0,48.36501,9.802742,29.0,42.0,48.0,57.0,73.0
beauty,463.0,6.27114e-08,0.788648,-1.450494,-0.656269,-0.068014,0.545602,1.970023
eval,463.0,3.998272,0.554866,2.1,3.6,4.0,4.4,5.0
students,463.0,36.62419,45.018481,5.0,15.0,23.0,40.0,380.0
allstudents,463.0,55.17711,75.0728,8.0,19.0,29.0,60.0,581.0
prof,463.0,45.43413,27.508902,1.0,20.0,44.0,70.5,94.0
PrimaryLast,463.0,0.2030238,0.402685,0.0,0.0,0.0,0.0,1.0
vismin,463.0,0.1382289,0.345513,0.0,0.0,0.0,0.0,1.0
female,463.0,0.4211663,0.49428,0.0,0.0,0.0,1.0,1.0
single_credit,463.0,0.05831533,0.234592,0.0,0.0,0.0,0.0,1.0


## <a id='toc1_3_'></a>[Explorar contenido mediante simple agregación](#toc0_)

1. Mediante una operación aggregate, calcula el promedio de todas las columnas (numéricas) en el DF. 

In [6]:
df.select_dtypes('number').mean()

age                4.836501e+01
beauty             6.271140e-08
eval               3.998272e+00
students           3.662419e+01
allstudents        5.517711e+01
prof               4.543413e+01
PrimaryLast        2.030238e-01
vismin             1.382289e-01
female             4.211663e-01
single_credit      5.831533e-02
upper_division     6.609071e-01
English_speaker    9.395248e-01
tenured_prof       7.796976e-01
dtype: float64

2. Realiza una operación de agregación por fila de manera que despliegues la suma por fila.

In [7]:
df.select_dtypes('number').sum(axis=1) # Axis 1 representa las columnas. De ahi sacamos la suma.

0      113.589916
1      256.989916
2      246.889916
3      247.689916
4      104.762268
          ...    
458    117.433396
459    173.611563
460    241.099420
461    160.943014
462    242.391822
Length: 463, dtype: float64

3. Mediante una operación aggregate, calcula la desviación media absoluta de la variable beauty.

In [8]:
df['beauty'].mean()

6.271139975345787e-08

In [9]:
df['beauty'].std()

0.7886476677562292

4. Para la variable eval, calcula mediante operaciones de agregación el mínimo y máximo valor de item. 

In [None]:
df['eval'].aggregate([
    'min', 'max'
])

min    2.1
max    5.0
Name: eval, dtype: float64

## <a id='toc1_4_'></a>[Procesamiento de la información mediante agrupamiento:](#toc0_)

1. Agrupa según género para calcular el promedio de puntuaciones de belleza, así como su  desviación estándar y varianza. 

In [11]:
df_gender_grp = df.groupby('gender')['beauty'].aggregate(['mean', 'std', 'var'])

In [12]:
df_gender_grp

Unnamed: 0_level_0,mean,std,var
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.116109,0.81781,0.668813
male,-0.084482,0.75713,0.573246


2. Calcula el total de hombres y mujeres que son profesores titulares (tenure professor)

In [13]:
df_prof_grp = df.groupby(['gender', 'tenure'])['tenure'].size()

In [14]:
df_prof_grp

gender  tenure
female  no         50
        yes       145
male    no         52
        yes       216
Name: tenure, dtype: int64

3. Agrega una columna al resultado previo para desplegar el porcentaje de profesores titulares por género.

In [27]:
df_prof_grp_percent = pd.DataFrame(df.groupby(['tenure', 'gender']).size(), columns=['Conteo'])
df_prof_grp_percent = df_prof_grp_percent.loc['yes']
df_prof_grp_percent['Porcentaje'] = df_prof_grp_percent['Conteo'] / df_prof_grp_percent['Conteo'].sum() * 100
df_prof_grp_percent

Unnamed: 0_level_0,Conteo,Porcentaje
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,145,40.166205
male,216,59.833795


4. Agrupa por género y calcula sus descriptores estadísticos básicos (conteo, promedio, percentiles 25, 50 y 75) de sus edades mediante alguna función de agregación.

In [17]:
df.groupby('gender')['age'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,195.0,45.092308,8.532031,29.0,38.0,46.0,52.0,62.0
male,268.0,50.746269,9.993396,32.0,43.0,51.0,59.25,73.0


5. Define una función que filtre por suma de estudiantes > 100, agrupa por ID de profesor (prof)  y guarda en un nuevo DF. Finalmente, aplica una operación de filtrado para mostrar de este DF solo aquellos registros donde el profesor tenga en total más de 100 estudiantes.

In [23]:
def mas_de_cien(x):
    return x['students'].sum() > 100

In [26]:
df.groupby('prof').filter(mas_de_cien)

Unnamed: 0,minority,age,gender,credits,beauty,eval,division,native,tenure,students,allstudents,prof,PrimaryLast,vismin,female,single_credit,upper_division,English_speaker,tenured_prof
0,yes,36,female,more,0.289916,4.3,upper,yes,yes,24,43,1,0,1,1,0,1,1,1
1,yes,36,female,more,0.289916,3.7,upper,yes,yes,86,125,1,0,1,1,0,1,1,1
2,yes,36,female,more,0.289916,3.6,upper,yes,yes,76,125,1,0,1,1,0,1,1,1
3,yes,36,female,more,0.289916,4.4,upper,yes,yes,77,123,1,1,1,1,0,1,1,1
7,no,51,male,more,-0.571984,3.7,upper,yes,yes,55,55,3,0,0,0,0,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
451,no,32,male,more,1.231394,4.3,upper,yes,yes,52,86,93,1,0,0,0,1,1,1
452,yes,42,female,more,0.420400,3.3,upper,no,yes,48,84,94,0,1,1,0,1,0,1
453,yes,42,female,more,0.420400,3.3,upper,no,yes,52,67,94,0,1,1,0,1,0,1
454,yes,42,female,more,0.420400,3.2,upper,no,yes,54,66,94,0,1,1,0,1,0,1


- 58 profesores de 94 si cumplen con más de 100 estudiantes. 

In [28]:
(df.groupby('prof')['students'].sum() > 100).sum()

58

In [30]:
len(df.prof.unique())

94