***
# To be continued tomorrow:
create sentence fragments from columns

In [1]:
import pandas as pd # for data analysis
import gzip         # to work with zip files 
import spacy        # for NLP (dealing with occupations)
import random

# this changes the settings in your Jupyter Notebook so it displays multiple outputs
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [2]:
# loading from check-point
with gzip.open("../data/processed/working-101718_dataset.dta.gz", "rb") as datafile:
    working_df = pd.read_stata(datafile)


***
Because we will not be using every single one of these 130 columns we can start dropping some. <br>
The following I'll choose based on what I want my twitterbot to tweet, you may choose to keep whatever variable you're interested in if you are going to be using this dataset as well.

In [3]:
working_df.head().get(['occ_lemma', 'occ_tag', 'occ_nnps'])

# there are 137 columns so it's better to print them out this way
print(list(working_df.columns))

Unnamed: 0,occ_lemma,occ_tag,occ_nnps
0,,,
1,,,
2,"retail, salesperson",salesperson,
3,"retail, salesperson",salesperson,
4,,,


['datanum', 'serial', 'countyfips', 'city', 'gq', 'farm', 'ownershp', 'ownershpd', 'mortgage', 'mortgag2', 'farmprod', 'acrehous', 'mortamt1', 'mortamt2', 'rent', 'rentgrs', 'rentmeal', 'costelec', 'costgas', 'costwatr', 'costfuel', 'foodstmp', 'lingisol', 'fridge', 'hotwater', 'bedrooms', 'phone', 'cinethh', 'cilaptop', 'cismrtphn', 'citablet', 'ciothcomp', 'cidatapln', 'fuelheat', 'vehicles', 'ssmc', 'nfams', 'nsubfam', 'ncouples', 'multgen', 'multgend', 'pernum', 'perwt', 'sex', 'age', 'marst', 'birthyr', 'marrno', 'yrmarr', 'divinyr', 'widinyr', 'race', 'raced', 'hispan', 'hispand', 'bpl', 'bpld', 'ancestr1', 'ancestr1d', 'ancestr2', 'ancestr2d', 'citizen', 'yrnatur', 'yrimmig', 'yrsusa1', 'language', 'languaged', 'hcovany', 'hinsemp', 'hinscaid', 'hinscare', 'hinsva', 'hinsihs', 'educ', 'educd', 'gradeatt', 'gradeattd', 'schltype', 'degfield', 'degfieldd', 'degfield2', 'degfield2d', 'empstat', 'empstatd', 'labforce', 'occ', 'ind', 'classwkr', 'classwkrd', 'wkswork2', 'uhrswork', '

In [4]:
# We can make a list of variables to drop
income_vars = [col for col in working_df.columns if "inc" in col]

income_vars.remove("incwage") # we want to keep these
income_vars.remove("inctot")

working_df.drop(columns=income_vars, inplace = True)

# Repeat the process for other groups of variables
vet_vars = [col for col in working_df.columns if "vet" in col]

vet_vars.remove("vetstat")

working_df.drop(columns=vet_vars, inplace = True)

# randoms
other_vars = ['lingisol','city','multgend','ind','bpld','uhrswork','yrnatur', 'citizen','yrimmig','availble', 'foodstmp','marrno', 'divinyr', 'widinyr','wkswork2','mortgage', 'degfield', 'rentmeal','gq', 'degfield2','ownershp', 'ownershpd', 'mortgag2', 'farmprod', 'acrehous', 'mortamt1', 'mortamt2', 'rentgrs', 'fridge', 'hotwater', 'bedrooms', 'phone', 'cinethh', 'cilaptop', 'cismrtphn', 'citablet', 'ciothcomp', 'cidatapln', 'fuelheat', 'nfams', 'nsubfam', 'ncouples', 'birthyr', 'raced', 'race', 'hispan', 'hispand', 'ancestr1', 'ancestr2', 'languaged', 'educ', 'gradeatt', 'schltype', 'degfieldd', 'degfield2d', 'empstatd', 'classwkr', 'classwkrd', 'migrate1d', 'movedin']

working_df.drop(columns=other_vars, inplace = True)

# cost
cost_vars = [col for col in working_df.columns if 'cost' in col]

working_df.drop(columns=cost_vars, inplace = True)

# health insurance
health_vars = [col for col in working_df.columns if 'hins' in col]

working_df.drop(columns=health_vars, inplace = True)

***
You can save this trimmed dataset and start working on building your sentences from it.

In [5]:
with gzip.open("../data/processed/working-15Sep19-cleaned_dataset.dta.gz", "wb") as file:
    working_df.to_stata(file, write_index = False)

## Constructing sentences

Based on the variables left I put came up with 11 different categories.

1. Demographics:
  - countyfips, sex, age, marst, yrmarr,
2. Household:
  - farm, rent, vehicles, ssmc, multgen
3. Work:
  - empstat, labforce, occ, looking, pwstate2, occ_lemma, occ_tag, occ_nnps
4. Origin
  - bpl, ancestr1d, ancestr2d, yrsusa1
5. Language
  - language
6. Health coverage:
  - hcovany
7. Veteran
  - vetstat
8. Education
  - educd, gradeattd
9. Money
  - inctot, incwage, poverty
10. Moving
  - migrate1, migplac1
11. Commute
  - tranwork, carpool, riders, trantime, departs, arrives

Based on these categories we can create 11 potential sentence fragments. Of course, not all observations will have all 11 fragments.

Before moving to the code itself it's a good idea to map out the logic for each fragment in ___pseudo-code___:

###### Demographics

`countyfips`, `age`, and `sex` are values we can expect from every observation so we can build a sentence from there. The other variables _could or could not_ have values depending on whether a person is married or not (`marst`).

An example sentence:<br>
```python
sentence = "I'm {age}, from {countyfips}"
if sex == 'male':
    sentence += man emoji
else:
    sentence += woman emoji

if age >= 18:
    if marst == "never married/single":
        sentence = sentence + ". I'm single"
    elif "married" in marst:
        sentence += "I got married in {yrmarr}"
    else:
        sentence += first word of marst ## divorced, separated, or widowed.
else:
    pass
```

So you end up with either <br>
_"I'm 16 {emoji}, from San Diego county"_ or <br>
_"I'm 34 {emoji}, from Alameda county. I got married in 2007."_ or <br>
_"I'm 40 {emoji}, from Los Angeles county. I'm divorced."_

***
emoji unicode from: https://unicode.org/emoji/charts/full-emoji-list.html
***



In [6]:
working_df.head().get('occ_lemma')

0                       
1                       
2    retail, salesperson
3    retail, salesperson
4                       
Name: occ_lemma, dtype: object

In [7]:
boy_emoji = "\U0001F466"
girl_emoji = "\U0001F467"
man_emoji = "\U0001F468"
woman_emoji = "\U0001F469"

In [8]:
# sentence starter
working_df.loc[(working_df['sex'] == 'male'), 'pronounU'] = "He"
working_df.loc[(working_df['sex'] == 'female'), 'pronounU'] = "She"

working_df.loc[(working_df['sex'] == 'male'), 'pronounL'] = "he"
working_df.loc[(working_df['sex'] == 'female'), 'pronounL'] = "she"

working_df['demographics_sentence'] = working_df['pronounU'] + " was only " + working_df['age'].astype(str) + ". "


In [9]:
working_df['demographics_sentence'][:5]

0    She was only 68. 
1     He was only 75. 
2    She was only 50. 
3     He was only 49. 
4     He was only 22. 
Name: demographics_sentence, dtype: object

###### Household
All variables in the Household category are _conditional_

```python
if farm == 'farm':
    sentence = 'I live in a farm!'
else:
    sentence = ""

if rent >= 0:
    sentence += "I pay {rent} in rent."
else:
    sentence += ""
    
if vehicles > "1 available":
    sentence += "I have a car available {car emoji}"
else:
    sentence += ""
    
if ssmc != "households without a same-sex married couple":
    sentence += "{rainbow emoji}"
else:
    sentence += ""
    
if multgen == "2 generations" | "3+ generations":
    sentence += "more than 1 generation lives in my home."
else:
    sentence += ""
```

In some cases you'll end up with a blank string for sentence but in others you may potentially end up with a 4 part sentence: <br>_"I live in a farm! I pay {rent} in rent. I have a car available {car emoji}. {rainbow}. More than 1 generation lives in my home."_

In [10]:
car_emoji = "\U0001F697"
rainbow_emoji = "\U0001F308"

In [11]:
# Starter sentence
working_df['household_sentence'] = ""

# farm
working_df.loc[working_df['farm'] == 'farm', 'household_sentence'] = working_df.loc[working_df['farm'] == 'farm', 'pronounU'] + " lived on a farm. "

# rent
working_df.loc[working_df['rent'] > 0, 'household_sentence'] = working_df.loc[working_df['rent'] > 0, 'household_sentence'] + working_df.loc[working_df['rent'] > 0, 'pronounU'] + " paid $" + working_df.loc[working_df['rent'] > 0, 'rent'].astype(str) + " in rent. "

# car
working_df.loc[working_df['vehicles'] >= '1 available', 'household_sentence'] = working_df.loc[working_df['vehicles'] >= '1 available', 'household_sentence'] + working_df.loc[working_df['vehicles'] >= '1 available', 'pronounU'] + " had a car available. "

# same-sex couples
working_df.loc[working_df['ssmc'] != 'households without a same-sex married couple', 'household_sentence'] = working_df.loc[working_df['ssmc'] != 'households without a same-sex married couple', 'household_sentence'] + rainbow_emoji + " "

# multi-gen households
working_df.loc[working_df['multgen'] >= '2 generations', 'household_sentence'] = working_df.loc[working_df['multgen'] >= '2 generations', 'household_sentence'] + working_df.loc[working_df['multgen'] >= '2 generations', 'pronounU'] + " leaves behind a family. "


###### Work
The work sentence is a little more complicated. We used spacy to create `occ_lemma`, `occ_tag`, and `occ_nnps`. From which we can create a "job title" label but for the rest we can create other fragments.
foodstmp, empstat, labforce, occ, uhrswork, looking, availble, pwstate2, occ_lemma, occ_tag, occ_nnps

```python
if empstat == 'unemployed':
    sentence = "I'm unemployed"
    if looking == 'yes, looked for work':
        sentence += ", but I'm still looking for a job."
    else:
        sentence += "."
elif empstat == 'employed':
    sentence = "I work as {job label from occ_lemma or occ_tag}"
    if pwstate2 != 'n/a' & pwstate2 != 'california':
        sentence += " in {pwstate2}."
    else:
        pass
else:
    sentence = ""
    
```

So you end up with something like:<br>
_"I'm unemployed, but I'm still looking for a job."_ or <br>
_"I'm unemployed."_ or <br>
_"I work as a scientist in Canada."_

In [12]:
# cleaning up 'occ_lemma'
working_df['occ_lemma'] = working_df['occ_lemma'].str.split(", ,").str[0].str.replace(",","")

In [13]:
working_df['work_sentence'] = ""
working_df['occupation'] = working_df['occ_lemma']

working_df.loc[working_df['empstat'] == 'unemployed', 'work_sentence'] = working_df.loc[working_df['empstat'] == 'unemployed', 'pronounU'] + " was unemployed. "
working_df.loc[((working_df['empstat'] == 'unemployed') & (working_df['looking'] == 'yes, looked for work')), 'work_sentence'] = working_df.loc[((working_df['empstat'] == 'unemployed') & (working_df['looking'] == 'yes, looked for work')),'pronounU'] + " was unemployed, but looking for a job. "

working_df.loc[working_df['occ_lemma'].str[:1] == '(','occupation'] = working_df.loc[working_df['occ_lemma'].str[:1] == '(','occ_lemma'].str[3:-3]
working_df.loc[working_df['occ_tag'].str[:31] == 'miscellaneous, manager, funeral','occupation'] = 'manager'
working_df.loc[working_df['occ_lemma'] == 'property','occupation'] = 'property manager'

# employed
condition_ing = ((working_df['occ_lemma'].str[-3:] == 'ing') & (working_df['occ_lemma'].str.split().str.len() > 0))
condition_er = ((working_df['occ_lemma'].str[-3:] != 'ing') & (working_df['occ_lemma'].str.split().str.len() > 0))

condition_an = working_df['occupation'].str[:1].isin(['a','e','i','o','u'])
condition_a = ~working_df['occupation'].str[:1].isin(['a','e','i','o','u'])


working_df.loc[working_df['occ_lemma'].str[-1:] == 's','occupation'] = working_df.loc[working_df['occ_lemma'].str[-1:] == 's','occ_lemma'].str[:-1]


working_df.loc[condition_an & condition_er, 'occupation'] = 'an '+working_df.loc[condition_an & condition_er, 'occupation']
working_df.loc[condition_a & condition_er, 'occupation'] = 'a '+working_df.loc[condition_a & condition_er, 'occupation']

working_df.loc[condition_ing, 'work_sentence'] = working_df.loc[condition_ing, 'work_sentence'] + working_df.loc[condition_ing, 'pronounU'] + " worked in " + working_df.loc[condition_ing, 'occupation']
working_df.loc[condition_er, 'work_sentence'] = working_df.loc[condition_er, 'work_sentence'] + working_df.loc[working_df['occupation'] != "", 'pronounU'] + " was " + working_df.loc[condition_er, 'occupation']

# working somewhere else
working_df.loc[((working_df['pwstate2'] > "n/a") & (working_df['pwstate2'] <= "mexico") & (working_df['pwstate2'] != 'california')), "work_sentence"] = working_df.loc[((working_df['pwstate2'] > "n/a") & (working_df['pwstate2'] <= "mexico")), "work_sentence"] + " in " + working_df.loc[((working_df['pwstate2'] > "n/a") & (working_df['pwstate2'] <= "mexico") & (working_df['pwstate2'] != 'california')), "pwstate2"].astype(str)

working_df.loc[working_df['pwstate2'] != "n/a", 'work_sentence'] = working_df.loc[working_df['pwstate2'] != "n/a", 'work_sentence'] + ". "


working_df.query('occ_lemma != ""').get(['work_sentence','occupation','occ_lemma', 'occ_tag', 'occ_nnps']).sample(n=50)




Unnamed: 0,work_sentence,occupation,occ_lemma,occ_tag,occ_nnps
357349,She was a preschool and kindergarten teacher.,a preschool and kindergarten teacher,preschool and kindergarten teacher,teacher,
355872,He was a software developer.,a software developer,software developer,"software, developer, application, system, soft...",
159235,She was a manager.,a manager,miscellaneous manager,"miscellaneous, manager, funeral, service, mana...",
75277,She was a firstline supervisors of retail sale...,a firstline supervisors of retail sale worker,firstline supervisors of retail sale worker,"supervisors, sale, worker",supervisors
360764,He was a firstline supervisors of housekeeping...,a firstline supervisors of housekeeping and ja...,firstline supervisors of housekeeping and jani...,"supervisors, workers","supervisors, workers"
60636,He was an information security analyst.,an information security analyst,information security analyst,"information, security, analyst",
18933,She was a retail salesperson.,a retail salesperson,retail salesperson,salesperson,
258860,He was a cashier.,a cashier,cashier,cashier,
675,He was an interviewer.,an interviewer,interviewer,interviewer,
257414,She was a food service manager.,a food service manager,food service manager,"food, service, manager",


###### Origin
The origin sentence is a little more straight-forward.

```python
sentence = "I was born in {bpl}."
if ancestr1d != "not classified" | "other" | "not reported":
    sentence += "I am {ancestr1d}"
    if ancestr2d != "not classified" | "other" | "not reported":
        sentence += " and {ancestr2d}."
    else:
        sentence += "."
else:
    pass
```

In [14]:
# sentence starter
working_df["origin_sentence"] = working_df['pronounU'] + " was born in "

# birthplace
working_df["origin_sentence"] = working_df['origin_sentence'] + working_df['bpl'].astype(str).str.split("(").str[0].str.title() + ". "

# a little clean up
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("europe, ns", "europe")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("cambodia (kampuchea)", "cambodia")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("other ussr/russia", "USSR")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("asia, nec/ns", "asia")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("yemen arab republic (north)", "yemen")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("other n.e.c", "")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("united kingdom, ns", "UK")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("americas, n.s", "")

# ancestry
working_df.loc[working_df['ancestr1d'] < "united states", 'origin_sentence'] = working_df.loc[working_df['ancestr1d'] < "united states", 'origin_sentence'] + working_df.loc[working_df['ancestr1d'] < "united states", 'pronounU'] + " was " + working_df.loc[working_df['ancestr1d'] < "united states", 'ancestr1d'].astype(str).str.split("(").str[0].str.title().str.strip()
working_df.loc[working_df['ancestr2d'] < "united states", 'origin_sentence'] = working_df.loc[working_df['ancestr2d'] < "united states", 'origin_sentence'] + " and " + working_df.loc[working_df['ancestr2d'] < "united states", 'ancestr2d'].astype(str).str.split("(").str[0].str.title().str.strip()
working_df.loc[working_df['ancestr1d'] < "united states", 'origin_sentence'] = working_df.loc[working_df['ancestr1d'] < "united states", 'origin_sentence'] + ". "


###### Language, health coverage and veteran status
These are simple straight-forward sentences:
```python
if language != "other or not reported":
    sentence = "I speak {language} at home."
else:
    sentence = ""
    
if hcovany == "with health insurance coverage":
    sentence = "I have health insurance."
else:
    sentence = "I don't have health insurance."
    
if vetstat == "veteran":
    sentence = "I am a veteran."
else:
    sentence = ""
```

In [15]:
usa_flag_emoji = '\U0001F1FA\U0001F1F8'

In [16]:
# starter sentence
working_df['lhv_sentence'] = ""

# language
working_df.loc[(working_df['language'] > 'n/a or blank') & (working_df['language'] < 'other or not reported'), 'lhv_sentence'] = working_df.loc[(working_df['language'] > 'n/a or blank') & (working_df['language'] < 'other or not reported'), 'lhv_sentence'] + working_df.loc[(working_df['language'] > 'n/a or blank') & (working_df['language'] < 'other or not reported'), 'pronounU'] + " spoke " + working_df.loc[(working_df['language'] > 'n/a or blank') & (working_df['language'] < 'other or not reported'), 'language'].astype(str) + " at home. "

# health insurance
working_df.loc[working_df['hcovany'] == 'with health insurance coverage', 'lhv_sentence'] = working_df.loc[working_df['hcovany'] == 'with health insurance coverage', 'lhv_sentence'] + working_df.loc[working_df['hcovany'] == 'with health insurance coverage', 'pronounU'] + " had health insurance. "

# veteran
working_df.loc[working_df['vetstat'] == 'veteran', 'lhv_sentence'] = working_df.loc[working_df['vetstat'] == 'veteran', 'lhv_sentence'] + working_df.loc[working_df['vetstat'] == 'veteran', 'pronounU'] + " was a veteran " + usa_flag_emoji + " . "



###### Education
```python
if age < 18:
    sentence = "I am in {gradeattd}."
```

In [17]:
# starter sentence
working_df['education_sentence'] = ""

working_df.loc[(working_df['age'] < '18') & (working_df['gradeattd'] != 'n/a'), 'education_sentence'] = working_df.loc[(working_df['age'] < '18') & (working_df['gradeattd'] != 'n/a'), 'education_sentence'] + "I am in " + working_df.loc[working_df['age'] < '18', 'gradeattd'].astype(str) + ". "

###### Money
The money sentence is a pretty straight-forward as well
```python
if poverty <= 100:
    sentence = "I live under the poverty line."
else:
    if incwages > 36000:
        sentence = "I make {incwage} in wages."
    else:
        sentence = "I make {inctot} a year."
```

In [18]:
# starter sentence
working_df['money_sentence'] = ""

working_df.loc[working_df['poverty'] < 100, 'money_sentence'] = working_df.loc[working_df['poverty'] < 100, 'money_sentence'] + working_df.loc[working_df['poverty'] < 100, 'pronounU'] + " lived under the poverty line. "

working_df.loc[(working_df['incwage'] > 36000) & (working_df['incwage'] < 999998), 'money_sentence'] = working_df.loc[(working_df['incwage'] > 36000) & (working_df['incwage'] < 999998), 'money_sentence'] + working_df.loc[(working_df['incwage'] > 36000) & (working_df['incwage'] < 999998), 'pronounU'] + " made $" + working_df.loc[working_df['incwage'] > 36000, 'incwage'].astype(str) + " in wages. "

working_df.loc[(working_df['incwage'] <= 36000) & (working_df['incwage'] < 999998) & (working_df['inctot'] < 999998) & (working_df['inctot'] > 1), 'money_sentence'] = working_df.loc[(working_df['incwage'] <= 36000) & (working_df['incwage'] < 999998) & (working_df['inctot'] < 999998) & (working_df['inctot'] > 1), 'money_sentence'] + working_df.loc[(working_df['incwage'] <= 36000) & (working_df['incwage'] < 999998) & (working_df['inctot'] < 999998) & (working_df['inctot'] > 1), 'pronounU'] + " made $" + working_df.loc[(working_df['incwage'] <= 36000) & (working_df['incwage'] < 999998) & (working_df['inctot'] < 999998) & (working_df['inctot'] > 1), 'inctot'].astype(str) + " a year. "


###### Moving
For the moving sentence, we'll keep it simple. 
```python
if migrate1 == "moved between states":
    sentence = "I moved from {migplac1} last year."
elif migrate1 == "abroad one year ago":
    if len(migplac1.split()) == 1: # if migplac1 is a one word country to keep tweets short
        sentence = "I moved from {migplac1} last year."
    else:
        pass
else:
    sentence = ""
```

In [19]:
# starter sentence
working_df['moving_sentence'] = ""

working_df.loc[working_df['migrate1'] == 'moved between states', 'moving_sentence'] = working_df.loc[working_df['migrate1'] == 'moved between states', 'moving_sentence'] + working_df.loc[working_df['migrate1'] == 'moved between states', 'pronounU'] + " moved from " + working_df.loc[working_df['migrate1'] == 'moved between states', 'migplac1'].astype(str).str.title() + " last year."

working_df.loc[(working_df['migrate1'] == 'abroad one year ago') & (working_df['migplac1'].str.split().str.len() == 1), 'moving_sentence'] = working_df.loc[(working_df['migrate1'] == 'abroad one year ago') & (working_df['migplac1'].str.split().str.len() == 1), 'moving_sentence'] + working_df.loc[(working_df['migrate1'] == 'abroad one year ago') & (working_df['migplac1'].str.split().str.len() == 1), 'pronounU'] + " moved from " + working_df.loc[(working_df['migrate1'] == 'abroad one year ago') & (working_df['migplac1'].str.split().str.len() == 1), 'migplac1'].astype(str).str.title() + " last year."


###### Commute
Commute is a little more complex because `carpool` depends on `tranwork` and `riders` depends on `carpool`. For `trantime` we could just say how many minutes each person commutes, if they commute, or find something interesting in the `departs` and `arrives` variable. 

```python
if tranwork == "auto, truck, or van":
    if carpool == 'carpools':
        if riders > '2 people':
            sentence = "I carpool with {riders}"
        else:
            sentence = ""
    else:
        sentence = "I drive alone"
elif tranwork == "bus or trolley bus":
    sentence = "I take the bus"
elif tranwork == "subway or elevated":
    sentence = "I take the subway"
elif tranwork == "ferryboat":
    sentence = "I take a ferry"
elif tranwork == "bicycle":
    sentence = "I bike"
elif tranwork == "railroad":
    sentence = "I take the train"
else:
    sentence = ""
    
if sentence != "":
    if trantime > 0:
        sentence += " for {trantime} minutes to work"
    else:
        sentence += "."
        
if trantime > 0:
    sentence = "I usually get there by {arrives}."
      
```

In [20]:
# fixing arrives and departs
working_df.loc[working_df['trantime'] > 0, 'arrives'] = working_df[working_df['trantime'] > 0]['arrives'].astype(str).str.zfill(4).str[:2] + ":" + working_df[working_df['trantime'] > 0]['arrives'].astype(str).str.zfill(4).str[2:]

working_df.loc[working_df['trantime'] > 0,'departs'] = working_df[working_df['trantime'] > 0]['departs'].astype(str).str.zfill(4).str[:2] + ":" + working_df[working_df['trantime'] > 0]['departs'].astype(str).str.zfill(4).str[2:]

In [21]:
# starter sentence
working_df['commute_sentence'] = ""

# carpool
working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] == 'carpools'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] == 'carpools'), 'commute_sentence'] + working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] == 'carpools'), 'pronounU'] + " carpooled with a few people "
working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] != 'carpools'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] != 'carpools'), 'commute_sentence'] + working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] != 'carpools'), 'pronounU'] + " drove "

# other ways
working_df.loc[(working_df['tranwork'] == 'bus or trolley bus'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'bus or trolley bus'), 'commute_sentence'] + working_df.loc[(working_df['tranwork'] == 'bus or trolley bus'), 'pronounU'] + "took the bus "
working_df.loc[(working_df['tranwork'] == 'subway or elevated'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'subway or elevated'), 'commute_sentence'] + working_df.loc[(working_df['tranwork'] == 'subway or elevated'), 'pronounU'] + "took the subway "
working_df.loc[(working_df['tranwork'] == 'ferryboat'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'ferryboat'), 'commute_sentence'] + working_df.loc[(working_df['tranwork'] == 'ferryboat'), 'pronounU'] + " took a ferry "
working_df.loc[(working_df['tranwork'] == 'bicycle'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'bicycle'), 'commute_sentence'] + working_df.loc[(working_df['tranwork'] == 'bicycle'), 'pronounU'] + " biked "
working_df.loc[(working_df['tranwork'] == 'railroad'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'railroad'), 'commute_sentence'] + working_df.loc[(working_df['tranwork'] == 'railroad'), 'pronounU'] + " took the train "

# how long
working_df.loc[(working_df['commute_sentence'] != "") & (working_df['trantime'] > 0), "commute_sentence"] = working_df.loc[(working_df['commute_sentence'] != "") & (working_df['trantime'] > 0), "commute_sentence"] + "for " + working_df.loc[(working_df['commute_sentence'] != "") & (working_df['trantime'] > 0), "trantime"].astype(str) + " minutes to work. " + working_df.loc[(working_df['commute_sentence'] != "") & (working_df['trantime'] > 0), "pronounU"] + " usually got there by " + working_df.loc[(working_df['commute_sentence'] != "") & (working_df['trantime'] > 0), "arrives"].astype(str) + '. '

***
Now we have 9 sentences to put together (language, health coverage, and veteran are grouped together).

In [22]:
sentence_columns = [col for col in working_df.columns if '_sentence' in col]
sentence_columns

['demographics_sentence',
 'household_sentence',
 'work_sentence',
 'origin_sentence',
 'lhv_sentence',
 'education_sentence',
 'money_sentence',
 'moving_sentence',
 'commute_sentence']

We will construct our final sentence by simply concatenating these. Some people will have some sentences and others won't so the `final_sentence` length will vary. We just need to make sure it stays under 280 characters to be able to tweet it out.<br>
We'll save these in a different dataset to keep the sizes manageable (in github at least, our `working` dataset is already almost 25mb).

In [23]:
losangeles_df = working_df[(working_df.countyfips != 'Los Angeles') & (working_df.age >= '18')].copy()
sentence_df = losangeles_df[sentence_columns].copy()

sentence_df.shape
sentence_df.head()

(215052, 9)

Unnamed: 0,demographics_sentence,household_sentence,work_sentence,origin_sentence,lhv_sentence,education_sentence,money_sentence,moving_sentence,commute_sentence
0,She was only 68.,She had a car available.,,She was born in China. She was Taiwanese.,She spoke chinese at home. She had health insu...,,She made $20000 a year.,,
1,He was only 75.,He had a car available.,,He was born in Philippines. He was Filipino.,"He spoke filipino, tagalog at home. He had hea...",,He made $34000 a year.,,
2,She was only 50.,She had a car available. She leaves behind a f...,She was a retail salesperson.,She was born in California. She was African-Am...,She spoke english at home. She had health insu...,,She made $14300 a year.,,She drove for 22 minutes to work. She usually ...
3,He was only 49.,He had a car available. He leaves behind a fam...,He was a retail salesperson.,He was born in California. He was White/Caucas...,He spoke english at home. He had health insura...,,He made $15700 a year.,,He drove for 141 minutes to work. He usually g...
4,He was only 22.,He had a car available. He leaves behind a fam...,"He was unemployed, but looking for a job.",He was born in California. He was African-Amer...,He spoke english at home. He had health insura...,,He made $11500 a year.,,


In [24]:
sentence_df['final_sentence'] = sentence_df['demographics_sentence'] + sentence_df['household_sentence'] + sentence_df['work_sentence'] + sentence_df['origin_sentence'] + sentence_df['lhv_sentence'] + sentence_df['education_sentence'] + sentence_df['money_sentence'] + sentence_df['moving_sentence'] + sentence_df['commute_sentence']

In [25]:
losangeles_df.head(30).get(['ancestr1d','ancestr2d','work_sentence','occ_lemma', 'occ_tag', 'occ_nnps'])



Unnamed: 0,ancestr1d,ancestr2d,work_sentence,occ_lemma,occ_tag,occ_nnps
0,taiwanese,not reported,,,,
1,filipino,not reported,,,,
2,"african-american (1990-2000, acs, prcs)",not reported,She was a retail salesperson.,retail salesperson,salesperson,
3,"white/caucasian (1990-2000, acs, prcs)",not reported,He was a retail salesperson.,retail salesperson,salesperson,
4,"african-american (1990-2000, acs, prcs)",not reported,"He was unemployed, but looking for a job.",,,
7,not reported,not reported,He was a miscellaneous production worker.,miscellaneous production worker,"miscellaneous, production, worker, semiconduct...",
8,not reported,not reported,,,,
11,vietnamese,not reported,He was a manager.,miscellaneous manager,"miscellaneous, manager, funeral, service, mana...",
12,not reported,not reported,She was a hairdresser.,hairdressers,"hairdressers, hairstylists, cosmetologist","hairdressers, hairstylists"
16,not reported,not reported,,,,


In [26]:
# Saving the lines as a text for twitterbot

with open("../data/processed/working-V1-obituaries.txt", 'w', encoding = 'utf-8',) as file:
    sentences = list(sentence_df['final_sentence'])
    random.shuffle(sentences)
    text = "\n".join(sentences)
    file.write(text)

52342839