# Dataset Generation
The below code shows how the initial city-country-region dataset is cleaned up and used to generate an artificial dataset of people, along with their connection to each other and the given cities.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Read in cities, countries and regions dataset

In [2]:
cities = pd.read_csv('city_in_region.csv', header=None, names=['city', 'country', 'region'])
cities.head()

Unnamed: 0,city,country,region
0,Chongqing,China,East Asia
1,Shanghai,China,East Asia
2,Beijing,China,East Asia
3,Karachi,Pakistan,Asia
4,Istanbul,Turkey,Middle East


#### Obtain counts of cities grouped by countries

In [3]:
grouped = cities.groupby(['country', 'region']).agg('count').rename(columns={'city': 'country_weight'})
grouped = grouped.reset_index()
grouped.head(5)

Unnamed: 0,country,region,country_weight
0,Afghanistan,Asia,1
1,Argentina,Latin America,1
2,Bangladesh,Asia,1
3,Brazil,Latin America,2
4,Chile,Latin America,1


#### Combine original dataset with a "weighted" country value (for sampling)

In [4]:
weighted = pd.merge(cities[['city', 'country', 'region']], grouped[['country', 'country_weight']], on='country', how='left')
weighted.head(10)

Unnamed: 0,city,country,region,country_weight
0,Chongqing,China,East Asia,25
1,Shanghai,China,East Asia,25
2,Beijing,China,East Asia,25
3,Karachi,Pakistan,Asia,3
4,Istanbul,Turkey,Middle East,2
5,Dhaka,Bangladesh,Asia,1
6,Tokyo,Japan,East Asia,2
7,Moscow,Russian Federation,Europe,2
8,Guangzhou,China,East Asia,25
9,Shenzhen,China,East Asia,25


#### Generate a stratified sample of 1000 cities (with repetition)
1000 cities are generated with repetition. The frequency of repetition for each is decided by the value of the `country_weight` using stratified sampling. This replicates a scenario where we have real-world data coming in, and that real-world data is correlated with the number of cities from each country (to avoid artificially high presence of one single city over others). 

In [5]:
sampled = weighted.sample(n=1000, replace=True, weights='country_weight', random_state=37).reset_index()
sampled = sampled.drop(columns=['index', 'region', 'country_weight'])
sampled.head(10)

Unnamed: 0,city,country
0,Pune,India
1,Hong Kong,China
2,Tianjin,China
3,Chennai,India
4,Xian,China
5,Shantou,China
6,Moscow,Russian Federation
7,Alexandria,Egypt
8,Foshan,China
9,Kolkata,India


#### Include person IDs for each entry in the stratified sample
A person is associated with each city that was generated at random.

In [6]:
sampled['personID'] = sampled.index + 1
sampled.tail(10)

Unnamed: 0,city,country,personID
990,Shenyang,China,991
991,Tianjin,China,992
992,Shenyang,China,993
993,Jaipur,India,994
994,Ningbo,China,995
995,Shantou,China,996
996,Giza,Egypt,997
997,Shantou,China,998
998,Nanjing,China,999
999,Wenzhou,China,1000


#### Assign random age values in the range 18-65 for each personID
We imagine that a real dataset would contain age ranges for adults in a specified age range.

In [7]:
# Set random seed for numpy
np.random.seed(37)

sampled['age'] = np.random.randint(18, 66, sampled.shape[0])
sampled.tail(10)

Unnamed: 0,city,country,personID,age
990,Shenyang,China,991,31
991,Tianjin,China,992,25
992,Shenyang,China,993,33
993,Jaipur,India,994,18
994,Ningbo,China,995,42
995,Shantou,China,996,55
996,Giza,Egypt,997,60
997,Shantou,China,998,33
998,Nanjing,China,999,64
999,Wenzhou,China,1000,40


#### Reorganize columns and output `person_in_city` as a CSV file
The data for persons living in each city is output to a separate file.

In [8]:
sampled = sampled[['personID', 'age', 'city', 'country']]
sampled.to_csv('person_in_city.csv', index=False, header=None)