# D1(b)
## Estimated resident population (2002-2019)
### [IstatData](https://esploradati.istat.it/databrowser/#/en/dw/categories/IT1,POP,1.0/POP_INTCENSPOP/DCIS_RICPOPRES2011/IT1,164_164_DF_DCIS_RICPOPRES2011_1,1.0)


In [1]:
#!pip install pandas requests requests_cache xmltodict

In [7]:
import pandas as pd
import json
import requests
import xmltodict
from datetime import datetime
import os

Since we are interested in a time span of five years (2018-2022), and the data in D1(a) starts at 2019, we are retrieving data from another dataset provided by Istat for information regarding the year 2018.


In [8]:
# 1 - Explore Data Structure

response = requests.get('https://esploradati.istat.it/SDMXWS/rest/datastructure/IT1/DCIS_RICPOPRES2011/')
print(response.status_code)

if response.status_code == 200:
    content = response.content
    
    if len(content) > 0:
        try:
            xml_data = xmltodict.parse(content)
            json_string_data = json.dumps(xml_data,
                                    allow_nan = True, # If we hadn't set allow_nan to
                                                      # true we would have got
                                                      # ValueError: Out of range float
                                                      # values are not JSON compliant
                                    indent = 6) # Indentation can be used for pretty-printing
            # Now you can work with the parsed JSON data
        except json.JSONDecodeError as e:
            print("Error decoding JSON:", e)
    else:
        print("Empty content received.")
else:
    print("Request failed with status code:", response.status_code)

# Uncomment the following line to see the resulting JSON string
# print(json_string_data)
type(json_string_data)

200


str

By querying the API, we will obtain an XML output that includes the `<structure:DimensionList>` tag, which contains the list of dimensions, i.e., the data schema of the dataset. In our case, the dimensions are as follows: `FREQ`, `REF_AREA`, `DATA_TYPE`, `AGE`, `SEX`, and `CITIZENSHIP`.

To understand the meaning of these abbreviations, we can look at the package called `codelist`. It can be queried by the previously discovered IDs. Let's, for example, explore `CITIZENSHIP`. Reading the XML above, we see that the ID to query relative to the package `codelist` is: `CL_CITTADINANZA`. Querying the URL [https://esploradati.istat.it/SDMXWS/rest/codelist/IT1/CL_CITTADINANZA](https://esploradati.istat.it/SDMXWS/rest/codelist/IT1/CL_CITTADINANZA) in Postman (some API responses are too long to be loaded in a Jupyter Notebook), we can see that this dimension is relative to the kind of citizenship. An example of one record is shown below:

```xml
<structure:Code id="FRG">
    <common:Name xml:lang="en">foreign</common:Name>
    <common:Name xml:lang="it">straniero-a</common:Name>
</structure:Code>
```

We are interested foreing residentons, and we find that:
- `ITL` is the ID for `italian`
- `FRG` is the ID for `foreign`
- `EU27` is the ID for `of European Union country with 27 Member States`
- `EU27_NOITL` is the ID for `of European Union country (except Italy)`
- `EXTEU27` is the ID for `of extra European Union country with 27 Member States`
- `APO` is the ID for `apolide`
- `FRGAPO` is the ID for `foreign/stateless`
- `TOTAL` is the ID for `total`al`de>

In [9]:
# 2 - Explore the meaning of the dimensions of the dataset

response = requests.get('https://esploradati.istat.it/SDMXWS/rest/codelist/IT1/CL_CITTADINANZA')
print(response.status_code)

if response.status_code == 200:
    content = response.content
    
    if len(content) > 0:
        try:
            xml_data = xmltodict.parse(content)
            json_string_data = json.dumps(xml_data,
                                    allow_nan = True, # If we hadn't set allow_nan to
                                                      # true we would have got
                                                      # ValueError: Out of range float
                                                      # values are not JSON compliant
                                    indent = 6) # Indentation can be used for pretty-printing
            # Now you can work with the parsed JSON data
        except json.JSONDecodeError as e:
            print("Error decoding JSON:", e)
    else:
        print("Empty content received.")
else:
    print("Request failed with status code:", response.status_code)

# Uncomment the following line to see the resulting JSON string
# print(json_string_data)

200


We explore all the other dimensions and since for our purpose we need all the ages we will have to understand how the are divided and which are the values. In order to do that we need to know the ID of our datasets. We can checkit in the URL on IstatData, and we see that is `164_164_DF_DCIS_RICPOPRES2011_1`.
We can query the API rest service like this to obtain our result:

In [10]:
# 3 - Explore values in Dimensions

response = requests.get('https://esploradati.istat.it/SDMXWS/rest/availableconstraint/164_164_DF_DCIS_RICPOPRES2011_1')
print(response.status_code)

if response.status_code == 200:
    content = response.content
    
    if len(content) > 0:
        try:
            xml_data = xmltodict.parse(content)
            json_string_data = json.dumps(xml_data,
                                    allow_nan = True, # If we hadn't set allow_nan to
                                                      # true we would have got
                                                      # ValueError: Out of range float
                                                      # values are not JSON compliant
                                    indent = 6) # Indentation can be used for pretty-printing
            # Now you can work with the parsed JSON data
        except json.JSONDecodeError as e:
            print("Error decoding JSON:", e)
    else:
        print("Empty content received.")
else:
    print("Request failed with status code:", response.status_code)

# Uncomment the following line to see the resulting JSON string
#Â print(json_string_data)

200


We can now compose our final query to retrieve all values relative to the Italian geographical regions for all sexes and all ages. Our final URL will be: `https://esploradati.istat.it/SDMXWS/rest/data/164_164_DF_DCIS_RICPOPRES2011_1/A.IT+ITCD+ITC+ITD+ITE+ITF+ITG.JAN.TOTAL.9.FRG/`

The filters we apply after the `data` request are:
- `TOTAL` where we specify we want the whole ages not divided
- `9` where we specify we want the data not divided by males and females.

Since we are interested in geographical regions, we find that:

- `ITC` is the ID for the Northwest (`nord-ovest` in Italian)
- `ITD` is the ID for the Northeast (`nord-est` in Italian)
- `ITCD` is the ID for the North (`nord` in Italian)
- `ITE` is the ID for the Center (`centro` in Italian)
- `ITF` is the ID for the South (`sud` in Italian)
- `ITG` is the ID for the Islands, including Sardinia and Sicily (`isole` in Italian)
- `IT` is the ID for the whole `Italy` (`Italia` in Italian)

We will need these IDs for our API request for this dimension.

In [11]:
# 4 -  Final query with filters

response = requests.get('https://esploradati.istat.it/SDMXWS/rest/data/164_164_DF_DCIS_RICPOPRES2011_1/A.IT+ITCD+ITC+ITD+ITE+ITF+ITG.JAN.TOTAL.9.FRG/')
print(response.status_code)

if response.status_code == 200:
    content = response.content
    
    if len(content) > 0:
        try:
            xml_data = xmltodict.parse(content)
            json_string_data = json.dumps(xml_data,
                                    allow_nan = True, # If we hadn't set allow_nan to
                                                      # true we would have got
                                                      # ValueError: Out of range float
                                                      # values are not JSON compliant
                                    indent = 6) # Indentation can be used for pretty-printing
            # Now you can work with the parsed JSON data
        except json.JSONDecodeError as e:
            print("Error decoding JSON:", e)
    else:
        print("Empty content received.")
else:
    print("Request failed with status code:", response.status_code)

# Print is disable in documentation since the response is too long to be shown here. Uncomment to see it.
# print(json_string_data)

200


Now we crate a well formed JSON sting from the response.

The code snippet performs the following tasks:

1. It takes a JSON string called `json_string_data` and creates a nested dictionary, `nested_dict`, using the `json.loads()` function. This step is essential to process and extract information from the JSON data.

2. It defines a translation dictionary, `sex_translation`, which maps numeric codes to corresponding gender labels ('1' to 'Male' and '2' to 'Female'). This dictionary will be used to translate the sex values later.

3. It initializes an empty list, `result`, which will store the extracted information from the nested dictionary.

4. It iterates over the series data in the nested dictionary. Each series represents a set of observations for a specific combination of variables.

5. Within each series, it retrieves the territory and sex values by searching for specific keys ('ITTER107' and 'SESSO') in the series key. If found, it assigns the corresponding values to the `territory` and `sex` variables, respectively. The sex value is translated using the `sex_translation` dictionary.

6. It retrieves the observation values (`obs_values`) for each series and iterates over them. Each observation contains information about the year, age, and quantity.

7. It creates an entry dictionary that contains the extracted information, including the territory, year, sex, age, and quantity.

8. The entry dictionary is appended to the `result` list.

9. Finally, the `result` list is converted to a JSON string, `immigrants_distribution_2018`, using `json.dumps()`. The type of the `immigrants_distribution_2018` variable is printed to verify that it is a string.

In summary, sintethis code processes the nested dictionary, extracts specific information, translates values, and organizes the extracted data into a clean JSON format suitable for visualization or further analysis.

In [12]:
# Creating a nested dictonary from the response in order to create a clean JSON for our visualization

nested_dict = json.loads(json_string_data)

# Translation dictionary
sex_translation = {
    '1': 'Male',
    '2': 'Female',
    '9': 'TOTAL'
}

# Extracting information
result = []

series_data = nested_dict['message:GenericData']['message:DataSet']['generic:Series']
for series in series_data:
    series_key = series['generic:SeriesKey']
    territory = None
    sex = None

    for value in series_key['generic:Value']:
        if value['@id'] == 'REF_AREA':
            territory = value['@value']
        elif value['@id'] == 'SEX':
            sex_value = value['@value']
            sex = sex_translation.get(sex_value)

    obs_values = series['generic:Obs']
    for obs in obs_values:
        year = obs['generic:ObsDimension']['@value']
        age = series_key['generic:Value'][3]['@value']
        quantity = obs['generic:ObsValue']['@value']

        entry = {
            'Territory': territory,
            'Year': int(year),
            'Sex': sex,
            'Age': age,
            'Quantity': int(quantity)
        }
        result.append(entry)

# Convert result to JSON
immigrants_distribution_2018 = json.dumps(result)

# Uncomment the following line to see the resulting JSON string
# print(immigrants_distribution_2018)
type(immigrants_distribution_2018)

str

In [13]:
# Load the JSON dataset
data = immigrants_distribution_2018

# Ensure data is parsed as a list of dictionaries
if isinstance(data, str):
    data = json.loads(data)

# Use a list comprehension to filter data for the year 2018
data_2018 = [entry for entry in data if entry["Year"] == 2018]

# Specify the folder path to save the JSON file
folder_path = "../_datasets/Clean/D1(b)"

# Create the folder if it doesn't exist
os.makedirs(folder_path, exist_ok=True)

# Define the file path for the output JSON file
filename = "filtered_immigrants_distribution_2018.json"

# Generate the file path
file_path = os.path.join(folder_path, filename)

# Save the filtered data as a JSON file with the same name
with open(file_path, "w") as json_file:
    json.dump(data_2018, json_file, indent=4)

print(f"Filtered data for 2018 saved as {filename}")

Filtered data for 2018 saved as filtered_immigrants_distribution_2018.json


After saving the JSON data related to 2018, we merge it with the other data related to 2019 till 2022, retrieved from the other dataset D1(a).

In [14]:
# Let's create a single json adding 2018's data with the other years

# Load the JSON file with data for other years
with open("../_datasets/Clean/D1(a)/immigrants_distribution.json", "r") as file:
    data_other_years = json.load(file)

# Combine the data from both files into a single list
combined_data = data_other_years + data_2018

sorted_data = sorted(combined_data, key=lambda x: x['Year'])

# Specify the folder path to save the combined JSON file
folder_path = "../_datasets/Clean/D1(b)"

# Define the file path for the output JSON file
filename = "combined_immigrants_distribution.json"

# Generate the file path
file_path = os.path.join(folder_path, filename)

# Save the combined data as a new JSON file
with open(file_path, "w") as json_file:
    json.dump(sorted_data, json_file, indent=4)

print(f"Combined data saved as {filename}")

Combined data saved as combined_immigrants_distribution.json
