***
# To be continued tomorrow:
create sentence fragments from columns

In [2]:
import pandas as pd # for data analysis
import gzip         # to work with zip files 
import spacy        # for NLP (dealing with occupations)

# this changes the settings in your Jupyter Notebook so it displays multiple outputs
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [3]:
# loading from check-point
with gzip.open("../data/processed/working-101718_dataset.dta.gz", "rb") as datafile:
    working_df = pd.read_stata(datafile)
    
working_df.head()

Unnamed: 0,datanum,serial,countyfips,city,gq,farm,ownershp,ownershpd,mortgage,mortgag2,...,pwstate2,tranwork,carpool,riders,trantime,departs,arrives,occ_lemma,occ_tag,occ_nnps
0,1,67752,Tulare,not in identifiable city (or size group),households under 1970 definition,non-farm,owned or being bought (loan),owned free and clear,"no, owned free and clear",,...,,,,,0,0,0,,,
1,1,67752,Tulare,not in identifiable city (or size group),households under 1970 definition,non-farm,owned or being bought (loan),owned free and clear,"no, owned free and clear",,...,,,,,0,0,0,,,
2,1,67753,Riverside,not in identifiable city (or size group),households under 1970 definition,non-farm,owned or being bought (loan),owned with mortgage or loan,"yes, mortgaged/ deed of trust or similar debt",no,...,california,"auto, truck, or van",drives alone,drives alone,22,1105,1124,"retail, salesperson",salesperson,
3,1,67753,Riverside,not in identifiable city (or size group),households under 1970 definition,non-farm,owned or being bought (loan),owned with mortgage or loan,"yes, mortgaged/ deed of trust or similar debt",no,...,california,"auto, truck, or van",drives alone,drives alone,141,502,704,"retail, salesperson",salesperson,
4,1,67753,Riverside,not in identifiable city (or size group),households under 1970 definition,non-farm,owned or being bought (loan),owned with mortgage or loan,"yes, mortgaged/ deed of trust or similar debt",no,...,,,,,0,0,0,,,


***
Because we will not be using every single one of these 130 columns we can start dropping some. <br>
The following I'll choose based on what I want my twitterbot to tweet, you may choose to keep whatever variable you're interested in if you are going to be using this dataset as well.

In [4]:
# We can make a list of variables to drop
income_vars = [col for col in working_df.columns if "inc" in col]

income_vars.remove("incwage") # we want to keep these
income_vars.remove("inctot")

working_df.drop(columns=income_vars, inplace = True)

# Repeat the process for other groups of variables
vet_vars = [col for col in working_df.columns if "vet" in col]

vet_vars.remove("vetstat")

working_df.drop(columns=vet_vars, inplace = True)

# randoms
other_vars = ['lingisol','city','multgend','ind','bpld','uhrswork','yrnatur', 'citizen','yrimmig','availble', 'foodstmp','marrno', 'divinyr', 'widinyr','wkswork2','mortgage', 'degfield', 'rentmeal','gq', 'degfield2','ownershp', 'ownershpd', 'mortgag2', 'farmprod', 'acrehous', 'mortamt1', 'mortamt2', 'rentgrs', 'fridge', 'hotwater', 'bedrooms', 'phone', 'cinethh', 'cilaptop', 'cismrtphn', 'citablet', 'ciothcomp', 'cidatapln', 'fuelheat', 'nfams', 'nsubfam', 'ncouples', 'birthyr', 'raced', 'race', 'hispan', 'hispand', 'ancestr1', 'ancestr2', 'languaged', 'educ', 'gradeatt', 'schltype', 'degfieldd', 'degfield2d', 'empstatd', 'classwkr', 'classwkrd', 'migrate1d', 'movedin']

working_df.drop(columns=other_vars, inplace = True)

# cost
cost_vars = [col for col in working_df.columns if 'cost' in col]

working_df.drop(columns=cost_vars, inplace = True)

# health insurance
health_vars = [col for col in working_df.columns if 'hins' in col]

working_df.drop(columns=health_vars, inplace = True)

***
You can save this trimmed dataset and start working on building your sentences from it.

In [131]:
with gzip.open("../data/processed/working-101818-cleaned_dataset.dta.gz", "wb") as file:
    working_df.to_stata(file, write_index = False)

## Constructing sentences

Based on the variables left I put came up with 11 different categories.

1. Demographics:
  - countyfips, sex, age, marst, yrmarr,
2. Household:
  - farm, rent, vehicles, ssmc, multgen
3. Work:
  - empstat, labforce, occ, looking, pwstate2, occ_lemma, occ_tag, occ_nnps
4. Origin
  - bpl, ancestr1d, ancestr2d, yrsusa1
5. Language
  - language
6. Health coverage:
  - hcovany
7. Veteran
  - vetstat
8. Education
  - educd, gradeattd
9. Money
  - inctot, incwage, poverty
10. Moving
  - migrate1, migplac1
11. Commute
  - tranwork, carpool, riders, trantime, departs, arrives

Based on these categories we can create 11 potential sentence fragments. Of course, not all observations will have all 11 fragments.

Before moving to the code itself it's a good idea to map out the logic for each fragment in ___pseudo-code___:

###### Demographics

`countyfips`, `age`, and `sex` are values we can expect from every observation so we can build a sentence from there. The other variables _could or could not_ have values depending on whether a person is married or not (`marst`).

An example sentence:<br>
```python
sentence = "I'm {age}, from {countyfips}"
if sex == 'male':
    sentence += man emoji
else:
    sentence += woman emoji

if age >= 18:
    if marst == "never married/single":
        sentence = sentence + ". I'm single"
    elif "married" in marst:
        sentence += "I got married in {yrmarr}"
    else:
        sentence += first word of marst ## divorced, separated, or widowed.
else:
    pass
```

So you end up with either <br>
_"I'm 16 {emoji}, from San Diego county"_ or <br>
_"I'm 34 {emoji}, from Alameda county. I got married in 2007."_ or <br>
_"I'm 40 {emoji}, from Los Angeles county. I'm divorced."_

***
emoji unicode from: https://unicode.org/emoji/charts/full-emoji-list.html
***

In [58]:
boy_emoji = "\U0001F466"
girl_emoji = "\U0001F467"
man_emoji = "\U0001F468"
woman_emoji = "\U0001F469"


In [63]:
# sentence starter
working_df['demographics_sentence'] = "I am"
working_df['demographics_sentence'] = working_df['demographics_sentence'] + " " + working_df['age'].astype(str) + ', from ' + working_df['countyfips'].astype(str).str.title() 

# emojis
working_df.loc[((working_df['sex'] == 'male') & (working_df['age'] < '18')), 'demographics_sentence'] = working_df.loc[((working_df['sex'] == 'male') & (working_df['age'] < '18')), 'demographics_sentence'] + " " + boy_emoji
working_df.loc[((working_df['sex'] == 'male') & (working_df['age'] >= '18')), 'demographics_sentence'] = working_df.loc[((working_df['sex'] == 'male') & (working_df['age'] >= '18')), 'demographics_sentence'] + " " + man_emoji

working_df.loc[((working_df['sex'] == 'female') & (working_df['age'] < '18')), 'demographics_sentence'] = working_df.loc[((working_df['sex'] == 'female') & (working_df['age'] < '18')), 'demographics_sentence'] + " " + girl_emoji
working_df.loc[((working_df['sex'] == 'female') & (working_df['age'] >= '18')), 'demographics_sentence'] = working_df.loc[((working_df['sex'] == 'female') & (working_df['age'] >= '18')), 'demographics_sentence'] + " " + woman_emoji

In [64]:
working_df['demographics_sentence'][:5]

0       I am 68, from Tulare ðŸ‘©
1       I am 75, from Tulare ðŸ‘¨
2    I am 50, from Riverside ðŸ‘©
3    I am 49, from Riverside ðŸ‘¨
4    I am 22, from Riverside ðŸ‘¨
Name: demographics_sentence, dtype: object

###### Household
All variables in the Household category are _conditional_

```python
if farm == 'farm':
    sentence = 'I live in a farm!'
else:
    sentence = ""

if rent >= 0:
    sentence += "I pay {rent} in rent."
else:
    sentence += ""
    
if vehicles > "1 available":
    sentence += "I have a car available {car emoji}"
else:
    sentence += ""
    
if ssmc != "households without a same-sex married couple":
    sentence += "{rainbow emoji}"
else:
    sentence += ""
    
if multgen == "2 generations" | "3+ generations":
    sentence += "more than 1 generation lives in my home."
else:
    sentence += ""
```

In some cases you'll end up with a blank string for sentence but in others you may potentially end up with a 4 part sentence: <br>_"I live in a farm! I pay {rent} in rent. I have a car available {car emoji}. {rainbow}. More than 1 generation lives in my home."_

In [79]:
car_emoji = "\U0001F697"
rainbow_emoji = "\U0001F308"

In [86]:
# Starter sentence
working_df['household_sentence'] = ""

# farm
working_df.loc[working_df['farm'] == 'farm', "household_sentence"] = "I live in a farm! "

# rent
working_df.loc[working_df['rent'] > 0, 'household_sentence'] = working_df.loc[working_df['rent'] > 0, 'household_sentence'] + "I pay $" + working_df.loc[working_df['rent'] > 0, 'rent'].astype(str) + " in rent. "

# car
working_df.loc[working_df['vehicles'] >= '1 available', 'household_sentence'] = working_df.loc[working_df['vehicles'] >= '1 available', 'household_sentence'] + "I have a car available " + car_emoji + ". "

# same-sex couples
working_df.loc[working_df['ssmc'] != 'households without a same-sex married couple', 'household_sentence'] = working_df.loc[working_df['ssmc'] != 'households without a same-sex married couple', 'household_sentence'] + rainbow_emoji + " "

# multi-gen households
working_df.loc[working_df['multgen'] >= '2 generations', 'household_sentence'] = working_df.loc[working_df['multgen'] >= '2 generations', 'household_sentence'] + "More than 1 generation lives in my home. "


###### Work
The work sentence is a little more complicated. We used spacy to create `occ_lemma`, `occ_tag`, and `occ_nnps`. From which we can create a "job title" label but for the rest we can create other fragments.
foodstmp, empstat, labforce, occ, uhrswork, looking, availble, pwstate2, occ_lemma, occ_tag, occ_nnps

```python
if empstat == 'unemployed':
    sentence = "I'm unemployed"
    if looking == 'yes, looked for work':
        sentence += ", but I'm still looking for a job."
    else:
        sentence += "."
elif empstat == 'employed':
    sentence = "I work as {job label from occ_lemma or occ_tag}"
    if pwstate2 != 'n/a' & pwstate2 != 'california':
        sentence += " in {pwstate2}."
    else:
        pass
else:
    sentence = ""
    
```

So you end up with something like:<br>
_"I'm unemployed, but I'm still looking for a job."_ or <br>
_"I'm unemployed."_ or <br>
_"I work as a scientist in Canada."_

In [90]:
# cleaning up 'occ_lemma'
working_df['occ_lemma'] = working_df['occ_lemma'].str.split(", ,").str[0].str.replace(",","")

In [101]:
# starter sentence
working_df['work_sentence'] = ""

# unemployed
working_df.loc[(working_df['empstat'] == 'unemployed'), 'work_sentence'] = working_df['work_sentence'] + "I'm unemployed."
working_df.loc[((working_df['empstat'] == 'unemployed') & (working_df['looking'] == 'yes, looked for work')), 'work_sentence'] = "I'm unemployed, but looking for a job. "

# employed
condition_ing = ((working_df['occ_lemma'].str[-3:] == 'ing') & (working_df['occ_lemma'].str.split().str.len() == 1))
condition_er = ((working_df['occ_lemma'].str[-2:] == 'er') & (working_df['occ_lemma'].str.split().str.len() == 1))

working_df.loc[condition_ing, 'work_sentence'] = working_df.loc[condition_ing, 'work_sentence'] + "I work in " + working_df.loc[condition_ing, 'occ_lemma']
working_df.loc[condition_er, 'work_sentence'] = working_df.loc[condition_er, 'work_sentence'] + "I'm a " + working_df.loc[condition_er, 'occ_lemma']

# working somewhere else
working_df.loc[((working_df['pwstate2'] > "n/a") & (working_df['pwstate2'] <= "mexico") & (working_df['pwstate2'] != 'california')), "work_sentence"] = working_df.loc[((working_df['pwstate2'] > "n/a") & (working_df['pwstate2'] <= "mexico")), "work_sentence"] + " in " + working_df.loc[((working_df['pwstate2'] > "n/a") & (working_df['pwstate2'] <= "mexico") & (working_df['pwstate2'] != 'california')), "pwstate2"].astype(str)

###### Origin
The origin sentence is a little more straight-forward.

```python
sentence = "I was born in {bpl}."
if ancestr1d != "not classified" | "other" | "not reported":
    sentence += "I am {ancestr1d}"
    if ancestr2d != "not classified" | "other" | "not reported":
        sentence += " and {ancestr2d}."
    else:
        sentence += "."
else:
    pass
```

In [160]:
# sentence starter
working_df['origin_sentence'] = "I was born in "

# birthplace
working_df["origin_sentence"] = working_df['origin_sentence'] + working_df['bpl'].astype(str).str.split("(").str[0].str.title() + ". "

# a little clean up
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("europe, ns", "europe")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("cambodia (kampuchea)", "cambodia")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("other ussr/russia", "USSR")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("asia, nec/ns", "asia")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("yemen arab republic (north)", "yemen")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("other n.e.c", "")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("united kingdom, ns", "UK")
working_df["origin_sentence"] = working_df['origin_sentence'].str.replace("americas, n.s", "")

# ancestry
working_df.loc[working_df['ancestr1d'] < "united states", 'origin_sentence'] = working_df.loc[working_df['ancestr1d'] < "united states", 'origin_sentence'] + "I'm " + working_df.loc[working_df['ancestr1d'] < "united states", 'ancestr1d'].astype(str).str.split("(").str[0].str.title()
working_df.loc[working_df['ancestr2d'] < "united states", 'origin_sentence'] = working_df.loc[working_df['ancestr2d'] < "united states", 'origin_sentence'] + "and " + working_df.loc[working_df['ancestr2d'] < "united states", 'ancestr2d'].astype(str).str.split("(").str[0].str.title()


###### Language, health coverage and veteran status
These are simple straight-forward sentences:
```python
if language != "other or not reported":
    sentence = "I speak {language} at home."
else:
    sentence = ""
    
if hcovany == "with health insurance coverage":
    sentence = "I have health insurance."
else:
    sentence = "I don't have health insurance."
    
if vetstat == "veteran":
    sentence = "I am a veteran."
else:
    sentence = ""
```

In [125]:
usa_flag_emoji = '\U0001F1FA\U0001F1F8'

In [127]:
# starter sentence
working_df['lhv_sentence'] = ""

# language
working_df.loc[(working_df['language'] > 'n/a or blank') & (working_df['language'] < 'other or not reported'), 'lhv_sentence'] = working_df.loc[(working_df['language'] > 'n/a or blank') & (working_df['language'] < 'other or not reported'), 'lhv_sentence'] + "I speak " + working_df.loc[(working_df['language'] > 'n/a or blank') & (working_df['language'] < 'other or not reported'), 'language'].astype(str) + " at home. "

# health insurance
working_df.loc[working_df['hcovany'] == 'with health insurance coverage', 'lhv_sentence'] = working_df.loc[working_df['hcovany'] == 'with health insurance coverage', 'lhv_sentence'] + "I have health insurance. "

# veteran
working_df.loc[working_df['vetstat'] == 'veteran', 'lhv_sentence'] = working_df.loc[working_df['vetstat'] == 'veteran', 'lhv_sentence'] + "I am a veteran " + usa_flag_emoji + " . "



###### Education
```python
if age < 18:
    sentence = "I am in {gradeattd}."
```

In [131]:
# starter sentence
working_df['education_sentence'] = ""

working_df.loc[working_df['age'] < '18', 'education_sentence'] = working_df.loc[working_df['age'] < '18', 'education_sentence'] + "I am in " + working_df.loc[working_df['age'] < '18', 'gradeattd'].astype(str) + ". "

###### Money
The money sentence is a pretty straight-forward as well
```python
if poverty <= 100:
    sentence = "I live under the poverty line."
else:
    if incwages > 36000:
        sentence = "I make {incwage} in wages."
    else:
        sentence = "I make {inctot} a year."
```

In [179]:
# starter sentence
working_df['money_sentence'] = ""

working_df.loc[working_df['poverty'] < 100, 'money_sentence'] = working_df.loc[working_df['poverty'] < 100, 'money_sentence'] + "I live under the poverty line. "

working_df.loc[(working_df['incwage'] > 36000) & (working_df['incwage'] < 999998), 'money_sentence'] = working_df.loc[(working_df['incwage'] > 36000) & (working_df['incwage'] < 999998), 'money_sentence'] + "I make $" + working_df.loc[working_df['incwage'] > 36000, 'incwage'].astype(str) + " in wages. "

working_df.loc[(working_df['incwage'] <= 36000) & (working_df['incwage'] < 999998) & (working_df['inctot'] < 999998) & (working_df['inctot'] > 1), 'money_sentence'] = working_df.loc[(working_df['incwage'] <= 36000) & (working_df['incwage'] < 999998) & (working_df['inctot'] < 999998) & (working_df['inctot'] > 1), 'money_sentence'] + "I make $" + working_df.loc[(working_df['incwage'] <= 36000) & (working_df['incwage'] < 999998) & (working_df['inctot'] < 999998) & (working_df['inctot'] > 1), 'inctot'].astype(str) + " a year. "


###### Moving
For the moving sentence, we'll keep it simple. 
```python
if migrate1 == "moved between states":
    sentence = "I moved from {migplac1} last year."
elif migrate1 == "abroad one year ago":
    if len(migplac1.split()) == 1: # if migplac1 is a one word country to keep tweets short
        sentence = "I moved from {migplac1} last year."
    else:
        pass
else:
    sentence = ""
```

In [142]:
# starter sentence
working_df['moving_sentence'] = ""

working_df.loc[working_df['migrate1'] == 'moved between states', 'moving_sentence'] = working_df.loc[working_df['migrate1'] == 'moved between states', 'moving_sentence'] + "I moved from " + working_df.loc[working_df['migrate1'] == 'moved between states', 'migplac1'].astype(str).str.title() + " last year."

working_df.loc[(working_df['migrate1'] == 'abroad one year ago') & (working_df['migplac1'].str.split().str.len() == 1), 'moving_sentence'] = working_df.loc[(working_df['migrate1'] == 'abroad one year ago') & (working_df['migplac1'].str.split().str.len() == 1), 'moving_sentence'] + "I moved from " + working_df.loc[(working_df['migrate1'] == 'abroad one year ago') & (working_df['migplac1'].str.split().str.len() == 1), 'migplac1'].astype(str).str.title() + " last year."


###### Commute
Commute is a little more complex because `carpool` depends on `tranwork` and `riders` depends on `carpool`. For `trantime` we could just say how many minutes each person commutes, if they commute, or find something interesting in the `departs` and `arrives` variable. 

```python
if tranwork == "auto, truck, or van":
    if carpool == 'carpools':
        if riders > '2 people':
            sentence = "I carpool with {riders}"
        else:
            sentence = ""
    else:
        sentence = "I drive alone"
elif tranwork == "bus or trolley bus":
    sentence = "I take the bus"
elif tranwork == "subway or elevated":
    sentence = "I take the subway"
elif tranwork == "ferryboat":
    sentence = "I take a ferry"
elif tranwork == "bicycle":
    sentence = "I bike"
elif tranwork == "railroad":
    sentence = "I take the train"
else:
    sentence = ""
    
if sentence != "":
    if trantime > 0:
        sentence += " for {trantime} minutes to work"
    else:
        sentence += "."
        
if trantime > 0:
    sentence = "I usually get there by {arrives}."
      
```

In [43]:
# fixing arrives and departs
working_df.loc[working_df['trantime'] > 0, 'arrives'] = working_df[working_df['trantime'] > 0]['arrives'].astype(str).str.zfill(4).str[:2] + ":" + working_df[working_df['trantime'] > 0]['arrives'].astype(str).str.zfill(4).str[2:]

working_df.loc[working_df['trantime'] > 0,'departs'] = working_df[working_df['trantime'] > 0]['departs'].astype(str).str.zfill(4).str[:2] + ":" + working_df[working_df['trantime'] > 0]['departs'].astype(str).str.zfill(4).str[2:]

In [155]:
# starter sentence
working_df['commute_sentence'] = ""

# carpool
working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] == 'carpools'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] == 'carpools'), 'commute_sentence'] + "I carpool with a few people"
working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] != 'carpools'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'auto, truck, or van') & (working_df['carpool'] != 'carpools'), 'commute_sentence'] + "I drive "

# other ways
working_df.loc[(working_df['tranwork'] == 'bus or trolley bus'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'bus or trolley bus'), 'commute_sentence'] + "I take the bus "
working_df.loc[(working_df['tranwork'] == 'subway or elevated'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'subway or elevated'), 'commute_sentence'] + "I take the subway "
working_df.loc[(working_df['tranwork'] == 'ferryboat'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'ferryboat'), 'commute_sentence'] + "I take a ferry "
working_df.loc[(working_df['tranwork'] == 'bicycle'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'bicycle'), 'commute_sentence'] + "I bike "
working_df.loc[(working_df['tranwork'] == 'railroad'), 'commute_sentence'] = working_df.loc[(working_df['tranwork'] == 'railroad'), 'commute_sentence'] + "I take the train "

# how long
working_df.loc[(working_df['commute_sentence'] != "") & (working_df['trantime'] > 0), "commute_sentence"] = working_df.loc[(working_df['commute_sentence'] != "") & (working_df['trantime'] > 0), "commute_sentence"] + " for " + working_df.loc[(working_df['commute_sentence'] != "") & (working_df['trantime'] > 0), "trantime"].astype(str) + " minutes to work. I usually get there by " + working_df.loc[(working_df['commute_sentence'] != "") & (working_df['trantime'] > 0), "arrives"].astype(str)

***
Now we have 9 sentences to put together (language, health coverage, and veteran are grouped together).

In [156]:
sentence_columns = [col for col in working_df.columns if '_sentence' in col]
sentence_columns

['demographics_sentence',
 'household_sentence',
 'work_sentence',
 'origin_sentence',
 'lhv_sentence',
 'education_sentence',
 'money_sentence',
 'moving_sentence',
 'commute_sentence']

We will construct our final sentence by simply concatenating these. Some people will have some sentences and others won't so the `final_sentence` length will vary. We just need to make sure it stays under 280 characters to be able to tweet it out.<br>
We'll save these in a different dataset to keep the sizes manageable (in github at least, our `working` dataset is already almost 25mb).

In [180]:
sentence_df = working_df[sentence_columns].copy()

sentence_df.shape
sentence_df.head()

(376035, 9)

Unnamed: 0,demographics_sentence,household_sentence,work_sentence,origin_sentence,lhv_sentence,education_sentence,money_sentence,moving_sentence,commute_sentence
0,"I am 68, from Tulare ðŸ‘©",I have a car available ðŸš—.,,I was born in China. I'm Taiwanese,I speak chinese at home. I have health insuran...,,I make $20000 a year.,,
1,"I am 75, from Tulare ðŸ‘¨",I have a car available ðŸš—.,,I was born in Philippines. I'm Filipino,"I speak filipino, tagalog at home. I have heal...",,I make $34000 a year.,,
2,"I am 50, from Riverside ðŸ‘©",I have a car available ðŸš—. More than 1 generati...,,I was born in California. I'm African-American,I speak english at home. I have health insuran...,,I make $14300 a year.,,I drive for 22 minutes to work. I usually get...
3,"I am 49, from Riverside ðŸ‘¨",I have a car available ðŸš—. More than 1 generati...,,I was born in California. I'm White/Caucasian,I speak english at home. I have health insuran...,,I make $15700 a year.,,I drive for 141 minutes to work. I usually ge...
4,"I am 22, from Riverside ðŸ‘¨",I have a car available ðŸš—. More than 1 generati...,"I'm unemployed, but looking for a job.",I was born in California. I'm African-American,I speak english at home. I have health insuran...,,I make $11500 a year.,,


In [181]:
sentence_df['final_sentence'] = sentence_df['demographics_sentence'] + sentence_df['household_sentence'] + sentence_df['work_sentence'] + sentence_df['origin_sentence'] + sentence_df['lhv_sentence'] + sentence_df['education_sentence'] + sentence_df['money_sentence'] + sentence_df['moving_sentence'] + sentence_df['commute_sentence']

In [190]:
sentence_df.to_csv("../data/processed/working-101918-V2_sentence_dataset.csv", encoding='utf-8', index = False,)

In [193]:
# Saving the lines as a text for twitterbot
from random import shuffle

with open("../data/processed/working-V2-sentences.txt", 'w', encoding = 'utf-8',) as file:
    sentences = list(sentence_df['final_sentence'])
    shuffle(sentences)
    text = "\n".join(sentences)
    file.write(text)

84603553