# Practica 39 - Operaciones con pandas sobre un 'json nested'

## Analizar con pandas el archivo anidado `companies.json`

### Rehacer los códigos del notebook `JSON_Limpiar.ipynb` a partir de la `Sección 2- Análisis JSON anidado con Pandas` Desde la celda de código 25.
- Añadir algunos cálculos de estadísticas descriptivas básicas (todos los que ud. quiera) sobre el dataframe final y guardar en otro dataframe dichos cálculos estádisticos. De forma opcional añada algunos 'plots' con matplotlib si quiere.

## Importación de librerías

In [53]:
import pandas as pd
import json

import pandas as pd
import numpy as np
import json
import pprint
import matplotlib as plt
from collections import Counter

## Lectura del JSON anidado

In [54]:
with open('companies.json') as file:
    data = json.load(file)

In [55]:
# Aplanar la estructura anidada
df = pd.json_normalize(data, record_path=['companies', 'employees'], 
                       meta=[['companies', 'company']],
                       record_prefix='employee_')

# Renombrar columnas para mayor claridadx
df.columns = [col.replace('employee_', '') for col in df.columns]

# Mostrar el DataFrame resultante
display(df)

Unnamed: 0,name,age,skills,salary,address.street,address.city,address.zipcode,companies.company
0,John Doe,30,"[Python, Data Analysis]",3450,123 Main St,New York,10001,Tech Solutions
1,Jane Smith,25,"[JavaScript, React]",2800,456 Market St,San Francisco,94105,Tech Solutions
2,Mike Johnson,35,"[Java, Spring]",4200,789 Broadway,Los Angeles,90015,Tech Solutions
3,Alice Brown,28,"[Ruby, Rails]",3100,101 First Ave,Seattle,98101,Innovative Apps
4,Bob White,32,"[PHP, Laravel]",2300,202 Second Ave,Austin,73301,Innovative Apps
5,Charlie Green,29,"[AWS, Azure]",3700,303 Third St,Chicago,60601,Cloud Services
6,Dana Blue,27,"[GCP, Docker]",2700,404 Fourth St,Miami,33101,Cloud Services
7,Eva Black,33,"[TensorFlow, Keras]",4500,505 Fifth Ave,Boston,2101,AI Innovations
8,Frank White,36,"[PyTorch, SciKit-Learn]",5100,606 Sixth Ave,Denver,80201,AI Innovations
9,George Brown,31,"[Network Security, Penetration Testing]",3900,707 Seventh St,Houston,77001,Cyber Security Inc


## Limpieza de Datos

In [56]:
df.info()

print('------------------------------------')

print(df.shape)
#No hay nulos 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   name               21 non-null     object
 1   age                21 non-null     int64 
 2   skills             21 non-null     object
 3   salary             21 non-null     int64 
 4   address.street     21 non-null     object
 5   address.city       21 non-null     object
 6   address.zipcode    21 non-null     object
 7   companies.company  21 non-null     object
dtypes: int64(2), object(6)
memory usage: 1.4+ KB
------------------------------------
(21, 8)


In [57]:
df.describe()

Unnamed: 0,age,salary
count,21.0,21.0
mean,29.857143,3592.857143
std,3.037856,721.15978
min,25.0,2300.0
25%,28.0,3100.0
50%,30.0,3600.0
75%,32.0,4000.0
max,36.0,5100.0


In [58]:
print(df.isna().sum())  #Valores nulos, no tiene

name                 0
age                  0
skills               0
salary               0
address.street       0
address.city         0
address.zipcode      0
companies.company    0
dtype: int64


In [59]:
df.columns

Index(['name', 'age', 'skills', 'salary', 'address.street', 'address.city',
       'address.zipcode', 'companies.company'],
      dtype='object')

### Renombrar columnas 

In [60]:
df.rename(columns={
                    'name':'nombre',                 
                    'age':'edad',                  
                    'skills': 'skills',               
                    'salary ': 'salario',             
                    'address.street': 'direccion',       
                    'address.city': 'ciudad',        
                    'address.zipcode': 'codigo_postal',     
                    'companies.company': 'compania'
}, inplace=True)

display(df.head())    

Unnamed: 0,nombre,edad,skills,salary,direccion,ciudad,codigo_postal,compania
0,John Doe,30,"[Python, Data Analysis]",3450,123 Main St,New York,10001,Tech Solutions
1,Jane Smith,25,"[JavaScript, React]",2800,456 Market St,San Francisco,94105,Tech Solutions
2,Mike Johnson,35,"[Java, Spring]",4200,789 Broadway,Los Angeles,90015,Tech Solutions
3,Alice Brown,28,"[Ruby, Rails]",3100,101 First Ave,Seattle,98101,Innovative Apps
4,Bob White,32,"[PHP, Laravel]",2300,202 Second Ave,Austin,73301,Innovative Apps


### Cambio de tipos de datos 

In [61]:
df.dtypes

nombre           object
edad              int64
skills           object
salary            int64
direccion        object
ciudad           object
codigo_postal    object
compania         object
dtype: object

In [62]:
df['edad'] = df['edad'].astype(int)
df['codigo_postal'] = df['codigo_postal'].astype(int)
print(df.dtypes)

nombre           object
edad              int32
skills           object
salary            int64
direccion        object
ciudad           object
codigo_postal     int32
compania         object
dtype: object


### Amplio la columna 'skills', ya que está como ua lista

In [63]:
skills_expanded = df['skills'].apply(pd.Series)
skills_expanded.columns = [f'skill_{i+1}' for i in skills_expanded.columns]

# Concatenate the expanded skills back to the original DataFrame
df = pd.concat([df.drop(columns='skills'), skills_expanded], axis=1)
display(df.head())


Unnamed: 0,nombre,edad,salary,direccion,ciudad,codigo_postal,compania,skill_1,skill_2
0,John Doe,30,3450,123 Main St,New York,10001,Tech Solutions,Python,Data Analysis
1,Jane Smith,25,2800,456 Market St,San Francisco,94105,Tech Solutions,JavaScript,React
2,Mike Johnson,35,4200,789 Broadway,Los Angeles,90015,Tech Solutions,Java,Spring
3,Alice Brown,28,3100,101 First Ave,Seattle,98101,Innovative Apps,Ruby,Rails
4,Bob White,32,2300,202 Second Ave,Austin,73301,Innovative Apps,PHP,Laravel


## Guardo el dataframe modificado a CSV

In [64]:
df.to_csv('data_limpia.csv', index=False)
print("DataFrame saved to 'data_limpia.csv'")

DataFrame saved to 'data_limpia.csv'
