# Dataset Generation
The below code shows how the initial city-country-region dataset is cleaned up and used to generate an artificial dataset of people, along with their connection to each other and the given cities.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Generate a random list of person IDs and their cities of residence

#### Read in cities, countries and regions dataset

In [2]:
cities = pd.read_csv('city_in_region.csv', header=None, names=['cityID', 'city', 'country', 'region'])
cities.head()

Unnamed: 0,cityID,city,country,region
0,1,Chongqing,China,East Asia
1,2,Shanghai,China,East Asia
2,3,Beijing,China,East Asia
3,4,Karachi,Pakistan,Asia
4,5,Istanbul,Turkey,Middle East


#### Obtain counts of cities grouped by countries

In [3]:
grouped = cities.groupby(['country', 'region']).agg('count').rename(columns={'city': 'country_weight'})
grouped = grouped.reset_index()
grouped.head(5)

Unnamed: 0,country,region,cityID,country_weight
0,Afghanistan,Asia,1,1
1,Argentina,Latin America,1,1
2,Bangladesh,Asia,1,1
3,Brazil,Latin America,2,2
4,Chile,Latin America,1,1


#### Combine original dataset with a "weighted" country value (for sampling)

In [4]:
weighted = pd.merge(cities[['city', 'country', 'region']], grouped[['country', 'country_weight']], on='country', how='left')
weighted.head(10)

Unnamed: 0,city,country,region,country_weight
0,Chongqing,China,East Asia,25
1,Shanghai,China,East Asia,25
2,Beijing,China,East Asia,25
3,Karachi,Pakistan,Asia,3
4,Istanbul,Turkey,Middle East,2
5,Dhaka,Bangladesh,Asia,1
6,Tokyo,Japan,East Asia,2
7,Moscow,Russian Federation,Europe,2
8,Guangzhou,China,East Asia,25
9,Shenzhen,China,East Asia,25


#### Generate a stratified sample of 100 cities (with repetition)
100 cities are generated with repetition. The frequency of repetition for each is decided by the value of the `country_weight` using stratified sampling. This replicates a scenario where we have real-world data coming in, and that real-world data is correlated with the number of cities from each country (to avoid artificially high presence of one single city over others). 

In [5]:
sampled = weighted.sample(n=100, replace=True, weights='country_weight', random_state=37).reset_index()
sampled = sampled.drop(columns=['index', 'region', 'country_weight'])
sampled.head(10)

Unnamed: 0,city,country
0,Pune,India
1,Hong Kong,China
2,Tianjin,China
3,Chennai,India
4,Xian,China
5,Shantou,China
6,Moscow,Russian Federation
7,Alexandria,Egypt
8,Foshan,China
9,Kolkata,India


#### Include person IDs for each entry in the stratified sample
A person is associated with each city that was generated at random.

In [6]:
sampled['personID'] = sampled.index + 1
sampled.head(10)

Unnamed: 0,city,country,personID
0,Pune,India,1
1,Hong Kong,China,2
2,Tianjin,China,3
3,Chennai,India,4
4,Xian,China,5
5,Shantou,China,6
6,Moscow,Russian Federation,7
7,Alexandria,Egypt,8
8,Foshan,China,9
9,Kolkata,India,10


#### Assign random age values in the range 18-45 for each personID
We imagine that a real dataset would contain age ranges for adults in a specified age range.

In [7]:
# Set random seed for numpy
np.random.seed(37)

sampled['age'] = np.random.randint(18, 46, sampled.shape[0])
sampled.tail(10)

Unnamed: 0,city,country,personID,age
90,Ahmedabad,India,91,22
91,Tianjin,China,92,39
92,Guangzhou,China,93,18
93,Hangzhou,China,94,41
94,Chengdu,China,95,31
95,Shenyang,China,96,28
96,Xian,China,97,25
97,Hangzhou,China,98,22
98,Ningbo,China,99,33
99,Shenyang,China,100,19


## Generate a random list of personIDs and connected persons
This section shows how to generate a random list of connections for each personID in the earlier dataset.

In [8]:
connections = sampled[['personID']].sample(n=1000, replace=True, random_state=37).reset_index().drop(columns=['index'])
connections.head(10)

Unnamed: 0,personID
0,16
1,77
2,93
3,54
4,23
5,36
6,68
7,43
8,86
9,64


In [9]:
np.random.seed(37)

connections['connectionID'] = np.random.randint(1, max(sampled.index), size=connections.shape[0])
# Sort by personID to read more easily
connections = connections.sort_values(by='personID')
connections = connections.reset_index().drop(columns=['index'])

In [10]:
connections.head(10)

Unnamed: 0,personID,connectionID
0,1,65
1,1,57
2,1,7
3,1,85
4,1,76
5,1,5
6,1,70
7,1,22
8,1,42
9,1,1


#### Remove those rows whose personID equals the connected persons ID
We don't want self-connections in our data!

In [11]:
connections = connections.query('personID != connectionID')
connections.tail(10)

Unnamed: 0,personID,connectionID
990,100,40
991,100,94
992,100,74
993,100,84
994,100,68
995,100,45
996,100,3
997,100,34
998,100,80
999,100,23


### Output data to files

#### Reorganize columns and output `person_in_city` to file
The data for persons living in each city and their connections is output to a separate file.

In [12]:
sampled = sampled[['personID', 'age', 'city', 'country']]
sampled.to_json('person_in_city.json', orient='records')

#### Output cities and regions to file

In [13]:
cities.to_json('city_in_region.json', orient='records')

#### Output persons and connections to file

In [14]:
connections.to_json('person_connections.json', orient='records')