# Practica 39 - Operaciones con pandas sobre un 'json nested'

## Analizar con pandas el archivo anidado `companies.json`

### Rehacer los códigos del notebook `JSON_Limpiar.ipynb` a partir de la `Sección 2- Análisis JSON anidado con Pandas` Desde la celda de código 25.
- Añadir algunos cálculos de estadísticas descriptivas básicas (todos los que ud. quiera) sobre el dataframe final y guardar en otro dataframe dichos cálculos estádisticos. De forma opcional añada algunos 'plots' con matplotlib si quiere.

In [8]:
import json

with open('companies.json', 'r') as file:
    data_json = json.load(file)

#Convertir el diccionario en una celda JSON  
data_str = json.dumps(data_json, indent=4)
print(data_str)  

{
    "companies": [
        {
            "company": "Tech Solutions",
            "employees": [
                {
                    "name": "John Doe",
                    "age": 30,
                    "address": {
                        "street": "123 Main St",
                        "city": "New York",
                        "zipcode": "10001"
                    },
                    "skills": [
                        "Python",
                        "Data Analysis"
                    ],
                    "salary": 3450
                },
                {
                    "name": "Jane Smith",
                    "age": 25,
                    "address": {
                        "street": "456 Market St",
                        "city": "San Francisco",
                        "zipcode": "94105"
                    },
                    "skills": [
                        "JavaScript",
                        "React"
                    ],
                    "s

Cargar el JSON anidado en un DataFrame de Pandas.  

Usaremos `pandas.json_normalize` para aplanar la estructura JSON y cargarla en un DataFrame.

In [55]:
import pandas as pd
import json

# Cargar el JSON desde un archivo (si no lo has hecho ya)
# with open('path_to_your_file.json', 'r') as file:
#     data_json = json.load(file)

# Verifica la estructura del JSON
print(data_json.keys())
print(data_json['companies'][0])

# Paso 1: Normalizar el JSON para obtener las compañías
df_companies = pd.json_normalize(data_json, record_path='companies')

# Paso 2: Explode la lista de empleados para obtener cada empleado en una fila separada
# Primero, necesitamos asegurarnos de que la columna 'employees' sea una lista
df_companies['employees'] = df_companies['employees'].apply(lambda x: x if isinstance(x, list) else [x])

# Explode la columna 'employees'
df_employees_expanded = df_companies.explode('employees')

# Paso 3: Normalizar los datos de los empleados
df_employees = pd.json_normalize(df_employees_expanded['employees'])

# Agregar la columna 'company' al DataFrame de empleados
df_employees['company'] = df_employees_expanded['company'].values

print(df_employees.head())



dict_keys(['companies'])
{'company': 'Tech Solutions', 'employees': [{'name': 'John Doe', 'age': 30, 'address': {'street': '123 Main St', 'city': 'New York', 'zipcode': '10001'}, 'skills': ['Python', 'Data Analysis'], 'salary': 3450}, {'name': 'Jane Smith', 'age': 25, 'address': {'street': '456 Market St', 'city': 'San Francisco', 'zipcode': '94105'}, 'skills': ['JavaScript', 'React'], 'salary': 2800}, {'name': 'Mike Johnson', 'age': 35, 'address': {'street': '789 Broadway', 'city': 'Los Angeles', 'zipcode': '90015'}, 'skills': ['Java', 'Spring'], 'salary': 4200}]}
           name  age                   skills  salary  address.street  \
0      John Doe   30  [Python, Data Analysis]    3450     123 Main St   
1    Jane Smith   25      [JavaScript, React]    2800   456 Market St   
2  Mike Johnson   35           [Java, Spring]    4200    789 Broadway   
3   Alice Brown   28            [Ruby, Rails]    3100   101 First Ave   
4     Bob White   32           [PHP, Laravel]    2300  202 Seco

In [11]:
df

Unnamed: 0,name,age,skills,salary,address.street,address.city,address.zipcode,company
0,John Doe,30,"[Python, Data Analysis]",3450,123 Main St,New York,10001,
1,Jane Smith,25,"[JavaScript, React]",2800,456 Market St,San Francisco,94105,
2,Mike Johnson,35,"[Java, Spring]",4200,789 Broadway,Los Angeles,90015,
3,Alice Brown,28,"[Ruby, Rails]",3100,101 First Ave,Seattle,98101,
4,Bob White,32,"[PHP, Laravel]",2300,202 Second Ave,Austin,73301,
5,Charlie Green,29,"[AWS, Azure]",3700,303 Third St,Chicago,60601,
6,Dana Blue,27,"[GCP, Docker]",2700,404 Fourth St,Miami,33101,
7,Eva Black,33,"[TensorFlow, Keras]",4500,505 Fifth Ave,Boston,2101,
8,Frank White,36,"[PyTorch, SciKit-Learn]",5100,606 Sixth Ave,Denver,80201,
9,George Brown,31,"[Network Security, Penetration Testing]",3900,707 Seventh St,Houston,77001,


### Limpieza de datos

Tratamiento de valores perdidos

Tratemos los valores que faltan en nuestro DataFrame. En este ejemplo, no tenemos valores perdidos, pero vamos a demostrar cómo se pueden manejar.

In [47]:
print(df_employees.isna().sum()) # Devuelve un dataFrame donde cada celda contiene 'True' si el valor del DataFrame original es 'NaN'

# Fill missing values with a placeholder or drop rows/columns with missing values
# Example: Fill missing values with 'Unknown'
df.fillna('Unknown', inplace=True) #Reemplaza todos los valores faltantes por 'Unknown' porque 'inplace=True'

name               0
age                0
skills             0
salary             0
address.street     0
address.city       0
address.zipcode    0
company            0
dtype: int64


### Renombrar Columnas

In [48]:
df.rename(columns={
    'address.street': 'street',
    'address.city': 'city',
    'address.zipcode': 'zipcode'
}, inplace=True)
print(df.head())

           name  age  salary          street           city zipcode  \
0      John Doe   30    3450     123 Main St       New York   10001   
1    Jane Smith   25    2800   456 Market St  San Francisco   94105   
2  Mike Johnson   35    4200    789 Broadway    Los Angeles   90015   
3   Alice Brown   28    3100   101 First Ave        Seattle   98101   
4     Bob White   32    2300  202 Second Ave         Austin   73301   

                                           companies     skill_1  \
0  [{'company': 'Tech Solutions', 'employees': [{...      Python   
1  [{'company': 'Tech Solutions', 'employees': [{...  JavaScript   
2  [{'company': 'Tech Solutions', 'employees': [{...        Java   
3  [{'company': 'Tech Solutions', 'employees': [{...        Ruby   
4  [{'company': 'Tech Solutions', 'employees': [{...         PHP   

         skill_2  
0  Data Analysis  
1          React  
2         Spring  
3          Rails  
4        Laravel  


In [39]:
# Expandir la columna 'address' en columnas separadas
if 'address' in df_employees.columns:
    df_address = pd.json_normalize(df_employees['address'])
else:
    df_address = pd.DataFrame()

# Expandir la columna 'skills' en columnas separadas
if 'skills' in df_employees.columns:
    df_skills = df_employees['skills'].apply(pd.Series)
else:
    df_skills = pd.DataFrame()

# Combinar los DataFrames
df_final = pd.concat([df_employees.drop(columns=['address', 'skills'], errors='ignore'), df_address, df_skills], axis=1)

# Imprimir el DataFrame final
print(df_final.head())




           name  age  salary  address.street   address.city address.zipcode  \
0      John Doe   30    3450     123 Main St       New York           10001   
1    Jane Smith   25    2800   456 Market St  San Francisco           94105   
2  Mike Johnson   35    4200    789 Broadway    Los Angeles           90015   
3   Alice Brown   28    3100   101 First Ave        Seattle           98101   
4     Bob White   32    2300  202 Second Ave         Austin           73301   

           company           0              1  
0   Tech Solutions      Python  Data Analysis  
1   Tech Solutions  JavaScript          React  
2   Tech Solutions        Java         Spring  
3  Innovative Apps        Ruby          Rails  
4  Innovative Apps         PHP        Laravel  


### Ampliación de las columnas de la lista

Si tiene columnas con listas (por ejemplo, skills), puede que desee ampliarlas en filas o columnas separadas.

In [46]:
print(df_final.head())

           name  age  salary  address.street   address.city address.zipcode  \
0      John Doe   30    3450     123 Main St       New York           10001   
1    Jane Smith   25    2800   456 Market St  San Francisco           94105   
2  Mike Johnson   35    4200    789 Broadway    Los Angeles           90015   
3   Alice Brown   28    3100   101 First Ave        Seattle           98101   
4     Bob White   32    2300  202 Second Ave         Austin           73301   

           company           0              1  
0   Tech Solutions      Python  Data Analysis  
1   Tech Solutions  JavaScript          React  
2   Tech Solutions        Java         Spring  
3  Innovative Apps        Ruby          Rails  
4  Innovative Apps         PHP        Laravel  
