\[Note: This project is provided as an example of how your project should flow as a document.  Your project, of course, may differ from this in content.  The main purpose of this example is to illustrate how to integrate your code and text to form something that reads more or less like a lab report or research paper.  When reading through it with that in mind, note the following good things about the project:

* There is an introduction section that explains what the project is and gives an overview of the data (including a link to the source!).
* Decisions about how to write the code are explained, not in excruciating detail, but enough.  There are some comments in the code, but longer discussion is put into markdown cells that preview or review the code.
* If the code required trying a few different things to get it to work, it doesn't include all the wrong tries, but briefly describes the troubleshooting process to explain how the eventual code was arrived at.
    * (If your project involves a more substantial attempt to compare different methods or ideas, of course you should show the different methods or ideas.  For instance, if you wanted to explore how different scikit models performed on the same task, you should show all the models and compare them.  Don't just show the best one and say "By the way I tried a bunch of other ones but they weren't as good."  In other words, there is a difference between "wrong tries" and "things I tried specifically to see if they would work and even if they didn't I can analyze why and/or compare them to other things that worked better".)
* The results are explained in conceptual terms, not just a bland report of numbers.
* There is a conclusion section that sums up what was found and ideas it generated as to what else could be done in the future.

Your project may be structured slightly differently (especially if you chose a quite different task, such as generating text rather than predicting something), but the above guidelines are still relevant.

The topic of this project is fairly simple and doesn't do much beyond what we did in the assignments.  This means it would not receive a perfect score on the "Content" section of the grading, because (as described in the project guidelines on GauchoSpace) part of what counts there is going beyond what we did in class.  The project does do a few small new things (mostly related to "discovering" useful features of pandas and scikit that we didn't discuss in class), but these are more technical in nature and not new ways of applying class concepts.

The project would get full points for Code and Explanation though.  It might get, about 25/40 on Content.  In other words, overall this project would be about an A-minus.  It takes an unambitious goal but accomplishes it cleanly.  Part of the point of this example is to say that if you do something very similar to what we did in class, you darn well better do a good job.  If you do something more ambitious, you have more leeway.

Have fun!\]

# Happy Moments
## An exploration

For my project I decided to work with the ["HappyDB" dataset](https://www.kaggle.com/ritresearch/happydb/home) that I found on Kaggle.  This dataset contains about 100,000 brief descriptions of "happy moments", collected via Amazon Mechanical Turk.  Some of the statements have a category label provided and some don't.  The dataset also has demographic information about the people who submitted the happy moments.

I thought this would be an interesting data set to explore because it provides a very "dense" coverage of a particular kind of emotion.  We discussed sentiment analysis in class, but there we were mostly distinguishing between positive and negative sentiments, or a scale from positive to negative (e.g., star ratings).  But in this case all sentiments are positive.  We can't do sentiment analysis per se with this data because there aren't any sentiment judgments (i.e., no scale or rating distinguishing "ecstatic moments" from "mildly pleasant moments").  But we can explore the details of the positive side of the sentiment spectrum.

For the main analysis of my project, what I did is I created some scikit models to try to predict the various demographic characteristics based on the text of the happy moment.  Then, I looked at the model coefficients and compared them to see what each word's effectiveness was at predicting each characteristic.  Before getting to that, though, I'll give a brief introduction to the dataset.

### Overview of the data

Let's take a quick look at what's in the dataset.  There are two files, `cleaned_hm.csv` and `demographic.csv`:

In [1]:
%config IPCompleter.greedy=True
%config IPCompleter.use_jedi = False

In [2]:
import pandas

moments = pandas.read_csv('../Data/happydb/cleaned_hm.csv')
demo = pandas.read_csv('../Data/happydb/demographic.csv')

The `moments` file contains data on each happy moment reported by a user:

In [3]:
moments.head()

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection
1,27674,2,24h,I was happy when my son got 90% marks in his e...,I was happy when my son got 90% marks in his e...,True,1,,affection
2,27675,1936,24h,I went to the gym this morning and did yoga.,I went to the gym this morning and did yoga.,True,1,,exercise
3,27676,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding
4,27677,6227,24h,I went with grandchildren to butterfly display...,I went with grandchildren to butterfly display...,True,1,,affection


There are roughly 100,000 moments here:

In [4]:
len(moments)

100535

By reading through the documentation for this dataset (which also has [its own website](https://rit-public.github.io/HappyDB/)) I was able to find out what the various columns mean:

* `hmid` is just a unique ID for each moment.
* `wid` is the worker ID, which can be used to link to demographic info in the other file (which I'll get to later).
* `reflection_period` is the period of time in which the person was supposed to imagine a happy moment (either "24h" for the past 24 hours, or "3m" for the past 3 months).
* `ground_truth_category` contains the "true" category of the moment.  It's not clear how this was derived but for my purposes I treated it as gospel.  Only about 15% of the moments have a ground truth category given, and they are divided among seven different categories.  The other 85% had no category.
* `predicted_category` is a category predicted by a classifier apparently set up by the creators of the dataset, but I didn't make use of this.
* Finally, `original_hm` contains the actual text of the happy moment description.  There is also `cleaned_hm` where they apparently did some spellchecking but I decided to work with the "raw" text for a more realistic task.

While looking around on the HappyDB website and Github repository, I noticed [a ticket](https://github.com/rit-public/HappyDB/issues/4) mentioning that some "moments" occur multiple times.  It does seem to be true:

In [5]:
moments.original_hm.duplicated().sum()

3978

It's not clear why this is but it seems like it's either due to bad data (people copy-pasting the same fake "happy moment" many times) or some error in the processing that was done when the data was collected.  Either way, I decided to remove duplicates to avoid biasing the results of my analyses.

In [6]:
moments.drop_duplicates(subset='original_hm', inplace=True)
moments.original_hm.duplicated().sum()

0

One thing to note is that many workers completed multiple happy moments.  Most did only a handful (six or less) but some did up to 96 different moments:

In [7]:
moments.wid.value_counts().describe()

count    10838.000000
mean         8.909116
std         13.100993
min          1.000000
25%          3.000000
50%          3.000000
75%          6.000000
max         96.000000
Name: wid, dtype: float64

This is relevant when we move on to look at the `demographic` file, which contains various demographic information about each user.  This information is linked to the main table via the `wid` column, which gives the Mechanical Turk ID of the worker.

In [8]:
demo.head()

Unnamed: 0,wid,age,country,gender,marital,parenthood
0,1,37.0,USA,m,married,y
1,2,29.0,IND,m,married,y
2,3,25.0,IND,m,single,n
3,4,32.0,USA,m,married,y
4,5,29.0,USA,m,married,y


There are about 10,000 distinct users:

In [9]:
len(demo)

10844

These demographic characteristics are good targets for machine learning analysis.  It looks like the `age` column was just something where people typed in whatever they wanted, so in addition to regular numbers it contains odd formats like `"60yrs"`.  After playing around a bit I was able to convert these to numbers by stripping out all non-numeric characters, and filling in "NA" if there was nothing left (which happened when someone put in something like "prefer not to say" for their age).  Also there apparently were a couple jokers who put in their age as 2 or 233 or something, so I converted those to NA as well.

The resulting age distribution is mostly fairly young people (under 40).

In [10]:
# convert strings to numbers
demo['clean_age'] = demo.age.map(lambda x: x if isinstance(x, float) else ''.join(char for char in x if char.isnumeric() or char == '.')).replace('', float('nan')).astype(float)
# make it so anyone who said their age was over 100 or under 17 has age counted as NA
demo.clean_age[demo.clean_age > 100] = float('nan')
demo.clean_age[demo.clean_age < 17] = float('nan')
demo.clean_age.describe()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


count    10798.000000
mean        32.799037
std         10.501696
min         17.000000
25%         25.000000
50%         30.000000
75%         37.000000
max         98.000000
Name: clean_age, dtype: float64

The other demographic features are:

* `country`, with about 85% of people being from the US.
* `gender`, with a roughly 50/50 split between male and female, with a small number of "o" (presumably for "other") and a few NA values.
* `marital` indicating marital status, with roughly 50% single, 40% married, and the rest distributed among the small categories of "divorced", "separated", "widowed", or NA
* `parenthood`, with about a 60/40 split between non-parents ("n") and parents ("y"), with a few NA values.

That gives an overview of the kind of data I was working with.

### Predicting demographic characteristics

For the first part of my project, I decided to use the text of the happy moments to try to predict some of the demographic characteristics of the people, namely marital status, parenthood, and age, along with the "reflection time".  To do this, I followed the same basic procedure we used in the Sentiment Analysis assignment.

#### Setup

Before doing that, I had to get my data into the right format.

First, I decided to "merge" the two tables so that the `moments` table included the demographic information as well.  To do this, I set the worker ID as the index on the demographics table:

In [11]:
demo.set_index('wid', inplace=True, drop=False)
demo.head()

Unnamed: 0_level_0,wid,age,country,gender,marital,parenthood,clean_age
wid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,37.0,USA,m,married,y,37.0
2,2,29.0,IND,m,married,y,29.0
3,3,25.0,IND,m,single,n,25.0
4,4,32.0,USA,m,married,y,32.0
5,5,29.0,USA,m,married,y,29.0


This let me use `.map` to get the demographic data for each column I wanted and add them to the moments table as new columns.  (There might be a better way to do this, but this is what I came up with.)

In [12]:
for demo_column in ['clean_age', 'country', 'gender', 'marital', 'parenthood']:
    moments[demo_column] = moments.wid.map(demo[demo_column])
moments.head()

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,clean_age,country,gender,marital,parenthood
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection,35.0,USA,m,single,n
1,27674,2,24h,I was happy when my son got 90% marks in his e...,I was happy when my son got 90% marks in his e...,True,1,,affection,29.0,IND,m,married,y
2,27675,1936,24h,I went to the gym this morning and did yoga.,I went to the gym this morning and did yoga.,True,1,,exercise,30.0,USA,f,married,y
3,27676,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding,28.0,DNK,f,married,n
4,27677,6227,24h,I went with grandchildren to butterfly display...,I went with grandchildren to butterfly display...,True,1,,affection,55.0,USA,f,divorced,y


Since I can't really deal with NA values in the scikit models, I decided to get rid of any rows that contain NaN in any column (except the "ground truth category", since that has a lot of NAs).  This still leaves me with about 96,000 rows.

In [13]:
moments = moments.dropna(subset=[col for col in moments.columns if col != 'ground_truth_category'])
len(moments)

95921

Since the various categories (like parenthood, country, etc.) need to be numeric for scikit, I found there is a Pandas method called `factorize` that will basically convert unique values to numbers.  I used this to make new columns containing numeric-ID versions of the various demographic categories.  I also included `reflection_period`, although it's a characteristic of the individual happy moment, not a demographic characteristic of the happy person.

In [14]:
for col in ['country', 'gender', 'marital', 'parenthood', 'reflection_period']:
    new_col_name = col + '_id'
    # the "sort" argument sorts things so that the numberic IDs will correspond to the categories in alphabetical order
    moments[new_col_name] = moments[col].factorize(sort=True)[0]
moments.head()

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,clean_age,country,gender,marital,parenthood,country_id,gender_id,marital_id,parenthood_id,reflection_period_id
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection,35.0,USA,m,single,n,94,1,3,0,0
1,27674,2,24h,I was happy when my son got 90% marks in his e...,I was happy when my son got 90% marks in his e...,True,1,,affection,29.0,IND,m,married,y,39,1,1,1,0
2,27675,1936,24h,I went to the gym this morning and did yoga.,I went to the gym this morning and did yoga.,True,1,,exercise,30.0,USA,f,married,y,94,0,1,1,0
3,27676,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding,28.0,DNK,f,married,n,21,0,1,0,0
4,27677,6227,24h,I went with grandchildren to butterfly display...,I went with grandchildren to butterfly display...,True,1,,affection,55.0,USA,f,divorced,y,94,0,0,1,0


#### Creating the models

With that I'm ready to start using scikit.  To begin, I created a train/test split.  I found that it's possible to just call `train_test_split` with one argument, getting train and test versions of just that one thing.  I did this because I wanted to create just one train/test split for the whole dataset, even though I'm going to create classifiers for several different Y variables.  Because the train and test values that are returned contain all columns of the DataFrame, I can reach into them to get whichever values I need.

In [15]:
import sklearn
import sklearn.model_selection

train, test = sklearn.model_selection.train_test_split(moments, random_state=888)
train.head()

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,clean_age,country,gender,marital,parenthood,country_id,gender_id,marital_id,parenthood_id,reflection_period_id
11565,39308,834,24h,My power bill was lower than I thought it woul...,My power bill was lower than I thought it woul...,True,1,,achievement,25.0,USA,f,married,n,94,0,1,0,0
2566,30255,1227,24h,I installed and APFC (Automatic Power Factor C...,I installed and APFC (Automatic Power Factor C...,True,1,,achievement,28.0,IND,f,married,y,39,0,1,1,0
15747,43516,413,24h,My boss approved time off for this Thursday to...,My boss approved time off for this Thursday to...,True,1,,affection,29.0,USA,m,married,y,94,1,1,1,0
53823,81770,942,3m,I received a couple unexpected gift cards.,I received a couple unexpected gift cards.,True,1,,enjoy_the_moment,32.0,USA,m,married,y,94,1,1,1,1
73510,101589,1969,3m,I won a prize for best volunteer award at a vo...,I won a prize for best volunteer award at a vo...,True,1,achievement,achievement,37.0,USA,m,single,n,94,1,3,0,1


Then, I created a TfidfVectorizer and used it to create a set of word features for the moment text.  I decided to include bigrams as well as single words, just for fun.

I played around with trying different parameters to the vectorizer.  None of them seemed to make a huge difference in the performance of my models.  Since my project wasn't focused on trying to get the best possible predictions, I didn't spend a lot of time trying to optimize this.  I did wind up using two tweaks: `max_df` sets the maximum document frequency; setting it to a fraction of 0.9 means it will throw out any features that occur in more than 90% of documents.  `min_df` sets the minimum document frequency; setting it to a whole number of 5 means it will throw out any features that occur in fewer than 5 documents.  Using `max_df` also sort of substitutes for using stop words, since any words that occur in almost all the texts will be thrown out.  Because of this, I didn't include the `stop_words` parameter.

In [16]:
import sklearn.feature_extraction.text
import nltk.corpus
#nltk.corpus.stopwords.words('english')

vect = sklearn.feature_extraction.text.TfidfVectorizer(ngram_range=(1, 2), max_df=0.8, min_df=5)
vect.fit(train.original_hm)

features_train = vect.transform(train.original_hm)
features_test = vect.transform(test.original_hm)

The vectorizer found about 35,000 words and bigrams as features:

In [17]:
features_train

<71940x36925 sparse matrix of type '<class 'numpy.float64'>'
	with 1806485 stored elements in Compressed Sparse Row format>

This same set of features will be used in all the models below.

For the later part of my project, I wanted to compare the influence of each word/bigram on each classification.  To prepare for that, I made a table to store the model coefficients.  To begin with, the table just has the feature numbers and the word/bigram, which I get using `.get_feature_names()`.

In [18]:
coef_table = pandas.DataFrame({"Word": vect.get_feature_names()})
coef_table.head(10)

Unnamed: 0,Word
0,00
1,00 am
2,00 clock
3,00 in
4,00 on
5,000
6,000 dollars
7,000 in
8,000 steps
9,00am


Later, I will add more columns to this table to see the influence of each word on each classification task.

For all of my classifiers, I decided to use the LinearSVC model from scikit since that seemed to be a good one for classification and didn't take too terribly long to run.

##### Predicting marital status

I started by making a model to predict the marital status of the people.  The procedure is basically the same as we did in class and in the assignment.

Just like with the vectorizer, I didn't spend a ton of time trying out different parameters for LinearSVC.  I saw in the documentation that it said "Prefer `dual=False` when `n_samples > n_features`".  Since my training set has about 72,000 samples (that is, separate happy moments) and only about 35,000 features, I guess that means I should use `dual=False`, so that's what I did.  I also found that setting C to 0.1 improved the accuracy slightly, although I can't say I know why.

In [19]:
import sklearn.svm

# create the model
marital_model = sklearn.svm.LinearSVC(dual=False, C=0.1)

# fit it, using marital_id as the Y variable
marital_model.fit(features_train, train.marital_id)

# create table to hold outputs
marital_out = pandas.DataFrame({"Actual": test.marital_id})

# create "baseline" prediction that predicts most common category all the time
# .value_counts() gives the most frequent one first so that's how I got it
# we count the most common only in the training set to avoid leakage
marital_out['Baseline'] = train.marital_id.value_counts().index[0]

# fill in real prediction
marital_out["Model"] = marital_model.predict(features_test)

# display
marital_out.head()

Unnamed: 0,Actual,Baseline,Model
57760,3,3,3
5683,1,3,3
72819,1,3,3
45876,1,3,1
39511,1,3,1


Here's the confusion matrix for my model.  I made it into a DataFrame so it shows up in a nicer display.  The five marital IDs 0-4 correspond to the marital status in alphabetical order, namely "divorced", "married", "separated", "single", and "widowed".  As we can see, the model only ever predicted "married" and "single" (only columns 1 and 3 have nonzero entries).  That's not super surprising because the other statuses were quite rare (if you look at the rows, rows 0, 2, and 4 have very few values overall compared to rows 1 and 3).

In [20]:
pandas.DataFrame(sklearn.metrics.confusion_matrix(marital_out.Actual, marital_out.Model))

Unnamed: 0,0,1,2,3,4
0,0,243,0,708,0
1,0,4659,0,5062,0
2,0,48,0,114,0
3,0,1694,0,11347,0
4,0,47,0,59,0


In order to evaluate my model, I wanted to compute some metrics.  Since I'll need to do this repeatedly for each model I do (predicting each demographic category), I made a function to put the metrics into a nice table for me:

In [21]:
import sklearn.metrics
# this assumes the true values are in a column called "Actual"
def classification_metrics(output_table):
    metric_table = pandas.DataFrame(index=output_table.columns, columns=["Accuracy", "Matthews", "F1"])
    for col in output_table.columns:
        metric_table.loc[col, "Accuracy"] = sklearn.metrics.accuracy_score(output_table['Actual'], output_table[col])
        metric_table.loc[col, "Matthews"] = sklearn.metrics.matthews_corrcoef(output_table['Actual'], output_table[col])
        metric_table.loc[col, "F1"] = sklearn.metrics.f1_score(output_table['Actual'], output_table[col], average='micro')
    return metric_table

Now I can use this to get all the metrics for my model:

In [22]:
classification_metrics(marital_out)

  mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)


Unnamed: 0,Accuracy,Matthews,F1
Actual,1.0,1.0,1.0
Baseline,0.543806,0.0,0.543806
Model,0.667445,0.348674,0.667445


The model does better than the baseline model that just picks the most common marital status every time, but it's not super great.  It gets the right answer about two-thirds of the time, whereas the baseline model would be right about 55% of the time.  As mentioned above, since my project isn't focused on improving this model I didn't spend much time trying to get it working better.

The next thing was to add the coefficients to my table.  This was a bit confusing because when I looked at the `.coef_` attribute like we did in class, I found it seemed to have 5 separate arrays instead of just one:

In [23]:
marital_model.coef_

array([[ 3.07382540e-01, -3.28963540e-02, -8.78684776e-03, ...,
        -3.18668996e-02, -9.95718868e-03, -3.91831117e-02],
       [-1.94071946e-01, -1.14926490e-01, -1.80154526e-02, ...,
        -3.06790635e-02, -1.96893717e-01, -9.28932454e-02],
       [ 2.86983349e-02, -3.68542446e-03, -2.83404772e-03, ...,
        -5.09436209e-03,  1.34367694e-07, -2.34113188e-03],
       [-2.06221630e-01,  1.65279148e-01,  2.95585743e-02, ...,
         8.57734408e-02,  1.07744954e-01,  5.91728549e-02],
       [ 1.20633620e-01, -8.50346233e-03, -2.46434107e-08, ...,
        -1.37816169e-02,  9.81809017e-02,  7.43126369e-02]])

In the documentation I saw it said the shape of `.coef_` would be "`[n_classes, n_features]`".  I eventually figured out that this means the five arrays in `.coef_` are giving me the weights of each word for each of the five different marital statuses, in alphabetical order.  So the first array there shows the amount each word affects the likelihood that the person is divorced, the second shows the likelihood that the person is married, etc.  I added these weights to my `coef_table` with appropriate labels:

In [24]:
statuses = ['Divorced', 'Married', 'Separated', 'Single', 'Widowed']
for ix in range(5):
    coef_table[statuses[ix]] = marital_model.coef_[ix, :]
coef_table.head()

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed
0,00,0.307383,-0.194072,0.028698,-0.206222,0.1206336
1,00 am,-0.032896,-0.114926,-0.003685,0.165279,-0.008503462
2,00 clock,-0.008787,-0.018015,-0.002834,0.029559,-2.464341e-08
3,00 in,0.053818,-0.146264,0.138538,-0.105549,0.06221828
4,00 on,0.077006,-0.137239,-0.002198,0.091393,-0.02111938


Just to check that these numbers mean what I think they do, I looked at the coefficient for a couple words that you'd think would obviously indicate a certain marital status:

In [25]:
coef_table[coef_table.Word.isin(('wife', 'husband'))]

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed
15100,husband,-0.341863,3.175869,0.007692,-3.080302,-0.096962
35795,wife,-0.26289,2.761686,-0.06103,-2.597856,-0.080679


Sure enough those words have high coefficients for Married and large negative coefficients for Single which obviously makes sense.  People would generally only mention a happy moment involving their husband/wife if they were married!

##### Predicting parenthood

Now that I have the setup out of the way, the process for predicting whether the person is a parent or not is basically exactly the same.

In [26]:
# create the model
parent_model = sklearn.svm.LinearSVC(dual=False, C=0.1)

# fit it, using parenthood_id as the Y variable
parent_model.fit(features_train, train.parenthood_id)

# create table to hold outputs
parent_out = pandas.DataFrame({"Actual": test.parenthood_id})

# create "baseline" prediction that predicts most common category all the time
# .value_counts() gives the most frequent one first so that's how I got it
# we count the most common only in the training set to avoid leakage
parent_out['Baseline'] = train.parenthood_id.value_counts().index[0]

# fill in real prediction
parent_out["Model"] =parent_model.predict(features_test)

# display
parent_out.head()

Unnamed: 0,Actual,Baseline,Model
57760,0,0,0
5683,0,0,0
72819,1,0,0
45876,1,0,1
39511,1,0,1


In [27]:
pandas.DataFrame(sklearn.metrics.confusion_matrix(parent_out.Actual, parent_out.Model))

Unnamed: 0,0,1
0,13281,1308
1,5461,3931


I can use my function from above to compute the classification metrics:

In [28]:
classification_metrics(parent_out)

Unnamed: 0,Accuracy,Matthews,F1
Actual,1.0,1.0,1.0
Baseline,0.608357,0.0,0.608357
Model,0.717735,0.388519,0.717735


Again the model is a bit better than the baseline, so that's nice.

This time the `.coef_` array only has one row.  I guess that's because, when there are only two categories, you only need to know how the feature affected one choice, since the other is just the opposite of that.  So like if "child" increased the probability that the person was a parent, it would have to decrease the probability that they were not a parent.  (Like the invisible man!  Get it?  Not apparent!  Haw haw!)

So I just added these coefficients into a column for "Parent".  (The original values of "parenthood" were "n" and "y", so these values will be showing the amount each feature pushes the prediction towards "y", the "higher" outcome in alphabetical order.)

In [29]:
coef_table['Parent'] = parent_model.coef_[0, :]
coef_table.head()

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent
0,00,0.307383,-0.194072,0.028698,-0.206222,0.1206336,0.445993
1,00 am,-0.032896,-0.114926,-0.003685,0.165279,-0.008503462,0.083226
2,00 clock,-0.008787,-0.018015,-0.002834,0.029559,-2.464341e-08,-0.028339
3,00 in,0.053818,-0.146264,0.138538,-0.105549,0.06221828,0.062386
4,00 on,0.077006,-0.137239,-0.002198,0.091393,-0.02111938,0.034466


Again, to sanity check myself, I looked at the coefficients for words related to having kids:

In [30]:
coef_table[coef_table.Word.isin(('son', 'daughter'))]

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent
7313,daughter,0.314199,2.235388,0.075679,-2.871619,0.244662,4.293275
27953,son,0.227793,2.38807,0.101341,-2.714006,-0.00204,4.181169


As expected, the coefficients are high, meaning that occurrence of the words "son" or "daughter" in the happy moment text increases the likelihood that the happy person has kids.

##### Predicting reflection period

Next, I tried to predict the "reflection period", which is the timespan during which the happy moment occurred (either in the past 24 hours or the past 3 months).  The process is the same as before.

In [31]:
# create the model
reflect_model = sklearn.svm.LinearSVC(dual=False, C=0.1)

# fit it, using reflection_period_id as the Y variable
reflect_model.fit(features_train, train.reflection_period_id)

# create table to hold outputs
reflect_out = pandas.DataFrame({"Actual": test.reflection_period_id})

# create "baseline" prediction that predicts most common category all the time
# .value_counts() gives the most frequent one first so that's how I got it
# we count the most common only in the training set to avoid leakage
reflect_out['Baseline'] = train.reflection_period_id.value_counts().index[0]

# fill in real prediction
reflect_out["Model"] =reflect_model.predict(features_test)

# display
reflect_out.head()

Unnamed: 0,Actual,Baseline,Model
57760,1,1,0
5683,0,1,1
72819,1,1,1
45876,1,1,1
39511,1,1,1


In [32]:
pandas.DataFrame(sklearn.metrics.confusion_matrix(reflect_out.Actual, reflect_out.Model))

Unnamed: 0,0,1
0,8042,3880
1,3965,8094


In [33]:
classification_metrics(parent_out)

Unnamed: 0,Accuracy,Matthews,F1
Actual,1.0,1.0,1.0
Baseline,0.608357,0.0,0.608357
Model,0.717735,0.388519,0.717735


This model is again better than the baseline model, but not super accurate.

I added the coefficients to my table.  Again there is only one set of coefficients since this is a binary classification between the options "24h" and "3m".  Since "24h" is alphabetically before "3m", higher values mean an increased likelihood of a three-month reflection period, so I called this column "LongReflect" to remind myself that bigger numbers indicate higher likelihood of longer reflection period.

In [34]:
coef_table['LongReflect'] = reflect_model.coef_[0, :]
coef_table.head()

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect
0,00,0.307383,-0.194072,0.028698,-0.206222,0.1206336,0.445993,0.027175
1,00 am,-0.032896,-0.114926,-0.003685,0.165279,-0.008503462,0.083226,0.06654
2,00 clock,-0.008787,-0.018015,-0.002834,0.029559,-2.464341e-08,-0.028339,0.001657
3,00 in,0.053818,-0.146264,0.138538,-0.105549,0.06221828,0.062386,0.061224
4,00 on,0.077006,-0.137239,-0.002198,0.091393,-0.02111938,0.034466,0.085032


It was a bit harder to think of words that I would expect to be indicators of a long or short reflection period.  I thought of using words for periods of time, on the theory that if someone mentions "yesterday" in their moment, it's more likely they were thinking of a happy moment in the past day, whereas if they mention "week" or "month" they were more likely thinking of a moment further in the past.

In [35]:
coef_table[coef_table.Word.isin(('yesterday', 'week', 'month'))]

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect
19381,month,-0.034576,0.235652,-0.067571,-0.195524,0.043111,0.204608,0.669396
35010,week,-0.03871,-0.010086,-0.004441,-0.002227,0.091841,-0.123643,0.604841
36712,yesterday,-0.047802,0.629287,0.035259,-0.584107,-0.049334,0.575843,-2.005901


The results are in line with what I expected.  The words "month" and "week" indicate a longer reflection period, and "yesterday" indicates a shorter period.

##### Predicting age

Finally I decided to predict the age of the happy person.  Since age is (just) a number, not a category, for this prediction I used a LinearSVR instead of LinearSVC.  But other than that the process is the same.

When I tried using my same code from before, I got this error:

```
ValueError: Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
```

I'm not sure what's going on here, but apparently the settings that work for the classification model don't work for the regression model.  It seems like I could change the `loss` or `dual` arguments, since the error message indicates that its the combination of those that causes the problem.  I looked in the documentation and saw that there's only one other option for `loss`, namely `"squared_epsilon_insensitive"`, so I tried that, and fortunately it worked.

In [36]:
# create the model
age_model = sklearn.svm.LinearSVR(dual=False, C=0.1, loss='squared_epsilon_insensitive')

# fit it, using clean_age as the Y variable
age_model.fit(features_train, train.clean_age)

# create table to hold outputs
age_out = pandas.DataFrame({"Actual": test.clean_age})

# since we're predicting a number, I made the "baseline" prediction just be the mean age of all people in the training set
age_out['Baseline'] = train.clean_age.mean()

# fill in real prediction
age_out["Model"] =age_model.predict(features_test)

# display
age_out.head()

Unnamed: 0,Actual,Baseline,Model
57760,32.0,32.574437,29.824048
5683,27.0,32.574437,33.257766
72819,34.0,32.574437,30.953947
45876,37.0,32.574437,33.294538
39511,23.0,32.574437,38.337989


To show the results, I took a look at the matplotlib plotting library.  After loading it up, I can plot a scatter plot using some methods on the DataFrame:

In [37]:
%matplotlib notebook

age_out.plot.scatter(x='Actual', y='Model', alpha=0.2)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x107375c0>

At least it looks like the model isn't predicting ages wildly outside the real range (like negative ages), so that's good.  Also there is at least a slight upward slope, which means that overall the model was on the right track.  But it flattens out toward the right, meaning maybe the model is not so great at predicting the age above about 30 or 40 years old.

We can compute some regression metrics for this model.  I don't really need to make a function for this because this is the only regression I'm doing but what the heck, I like functions.

In [38]:
def regression_metrics(output_table):
    metric_table = pandas.DataFrame(index=output_table.columns, columns=["MAE", "RMSE"])
    for col in output_table.columns:
        # mean absolute error
        metric_table.loc[col, "MAE"] = sklearn.metrics.mean_absolute_error(output_table['Actual'], output_table[col])
        # root mean squared error
        metric_table.loc[col, "RMSE"] = sklearn.metrics.mean_squared_error(output_table['Actual'], output_table[col])**0.5
    return metric_table

In [39]:
regression_metrics(age_out)

Unnamed: 0,MAE,RMSE
Actual,0.0,0.0
Baseline,7.48946,9.94201
Model,6.84539,9.1454


The mean absolute error is about 6.8, so the model is off by 6.8 years on average when predicting the person's age.  Not too bad.  It's better than the baseline model, which is off by about 7.5 years.

Now I'll add the coefficients to my table.  Since this is a regression, higher coefficients just mean words that make the model think the person is older (i.e., has a larger age).

In [40]:
coef_table['Age'] = age_model.coef_
coef_table.head()

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect,Age
0,00,0.307383,-0.194072,0.028698,-0.206222,0.1206336,0.445993,0.027175,12.948442
1,00 am,-0.032896,-0.114926,-0.003685,0.165279,-0.008503462,0.083226,0.06654,-1.736856
2,00 clock,-0.008787,-0.018015,-0.002834,0.029559,-2.464341e-08,-0.028339,0.001657,-0.306756
3,00 in,0.053818,-0.146264,0.138538,-0.105549,0.06221828,0.062386,0.061224,1.210835
4,00 on,0.077006,-0.137239,-0.002198,0.091393,-0.02111938,0.034466,0.085032,1.148341


It's harder to find a sanity check for this one, but I checked for words about parents and children.  The theory would be that an older person is less likely to be talking about their parents (maybe because it's more likely their parents are dead), while a younger person is less likely to be talking about their kids (because the younger they are the less likely that they have kids at all).

In [41]:
coef_table[coef_table.Word.isin(('parents', 'son', 'daughter'))]

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect,Age
7313,daughter,0.314199,2.235388,0.075679,-2.871619,0.244662,4.293275,0.115926,21.00611
23274,parents,-0.227634,-0.333312,-0.037396,0.547463,-0.021327,-0.545503,0.122411,-7.294384
27953,son,0.227793,2.38807,0.101341,-2.714006,-0.00204,4.181169,0.021788,16.666495


The results align with what I thought.  The words "son" and "daughter" have positive coefficients, meaning people who are happy about their sons and daughters are older; "parents" has a negative coefficient, meaning people who are happy about their parents are younger.

### Looking at the coefficients

Now that we have those coefficients, we can sort the table by each column and see which words had the biggest effect on each variable.

We'll start with marital status.  Since the model only ever predicted single or married, I'll just look at those ones.  Here are the "biggest influences" on single-ness:

In [42]:
coef_table.sort_values('Single')

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect,Age
15100,husband,-0.341863,3.175869,0.007692,-3.080302,-0.096962,1.672187,-0.002021,9.971460
20292,my husband,-0.614624,3.551002,-0.144397,-3.068453,-0.087847,1.104823,-0.089435,4.607084
7313,daughter,0.314199,2.235388,0.075679,-2.871619,0.244662,4.293275,0.115926,21.006110
20677,my wife,-0.352022,2.992560,-0.096329,-2.761867,-0.048986,0.949931,0.010310,5.625787
27953,son,0.227793,2.388070,0.101341,-2.714006,-0.002040,4.181169,0.021788,16.666495
35795,wife,-0.262890,2.761686,-0.061030,-2.597856,-0.080679,1.170259,-0.068764,8.497696
16818,kids,0.012197,1.894543,0.007667,-1.923958,-0.013934,2.805580,0.112836,4.664844
20570,my son,0.116133,1.323226,0.090252,-1.593745,0.064127,3.025892,0.026791,11.895952
5939,children,0.091120,1.342756,0.082736,-1.578529,0.115363,2.159792,0.165847,12.272053
22795,our,-0.342499,1.706791,-0.007873,-1.442008,-0.033538,1.069542,-0.078977,5.607116


These are interesting and make sense.  Most of the ones with a negative coefficient are ones talking about a spouse or children, which would be more common as characteristics of married couples.  The ones with a positive coefficient often talk about unmarried relationships, like "girlfriend" and "boyfriend", or even "crush"!

As for marriage. . .

In [43]:
coef_table.sort_values('Married')

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect,Age
20013,my boyfriend,0.136661,-1.757143,0.084108,1.300937,-0.016660,-0.714093,-0.054077,-9.318716
4638,boyfriend,0.219535,-1.667855,0.029493,1.123184,0.076290,-0.397366,0.078738,-4.612715
20227,my girlfriend,-0.086080,-1.639607,-0.005678,1.568184,-0.025964,-1.342546,0.059385,-10.679214
12117,girlfriend,-0.068545,-1.350071,-0.048471,1.388360,-0.055157,-0.986142,0.151666,-8.622492
10142,fiance,0.152810,-0.902185,-0.048642,0.774048,-0.039190,-0.255214,0.283811,-4.095186
10150,fiancee,-0.118943,-0.817724,-0.037238,0.836768,0.119449,-0.260051,0.005089,-2.564143
31262,thing that,0.002289,-0.817610,-0.022544,0.832103,-0.000238,-0.444151,0.100902,-2.588172
22362,on my,0.010318,-0.814949,0.084820,0.775276,-0.017605,-0.760295,-0.010054,-2.414472
2261,and we,-0.024277,-0.811931,0.019154,0.867739,-0.020354,-0.759804,0.134899,-0.629654
25986,roommate,0.139418,-0.804065,-0.006119,0.700658,-0.039348,-0.751906,-0.021312,-1.843880


These are almost the mirror image of the ones for single!  I guess that makes sense, since if someone is single they're not married, pretty much by definition.

Now let's look at the coefficients for Parent:

In [44]:
coef_table.sort_values('Parent')

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect,Age
20227,my girlfriend,-0.086080,-1.639607,-0.005678,1.568184,-0.025964,-1.342546,0.059385,-10.679214
12117,girlfriend,-0.068545,-1.350071,-0.048471,1.388360,-0.055157,-0.986142,0.151666,-8.622492
1128,am feel,-0.018239,-0.550452,0.024046,0.525504,-0.005997,-0.904386,-0.213118,0.612748
20527,my roommate,0.102541,-0.761471,0.016491,0.648534,-0.024876,-0.860793,-0.324486,-3.039358
20378,my mom,-0.229004,-0.347139,-0.059723,0.652466,-0.040285,-0.839253,0.320206,-7.832780
21281,niece,-0.026021,-0.611459,0.009396,0.636260,0.022957,-0.827140,0.387248,7.278141
14201,her and,0.002920,-0.484939,-0.057226,0.518675,0.015203,-0.811551,0.260818,-5.637352
33276,university,-0.112789,-0.794852,-0.036031,0.916225,-0.031054,-0.789694,0.386445,-11.029000
35798,wife and,-0.157013,1.028010,-0.065361,-0.843367,-0.069584,-0.776870,0.149307,-3.587195
20037,my cat,0.040523,-0.635885,0.070421,0.570112,-0.034685,-0.764612,-0.276195,1.599920


Again the ones with high coefficients are about kids, similar to what we saw with the Married status.  The ones with strongly negative coefficients are again about non-married relationships, and there seem to be some pets in there too.  That sort of makes sense because people who have pets but not kids sometimes do attach themselves to the pets and find happiness in them, whereas people with kids would find similar happiness in their kids.

Now we'll look at the coefficients for LongReflect, which was the one I was most interested in because I couldn't really figure out what words might indicate a longer or shorter reflection period.

In [45]:
coef_table.sort_values('LongReflect')

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect,Age
32507,today,0.040452,0.244824,-0.053790,-0.231948,-0.052616,0.386374,-2.065688,1.497753
36712,yesterday,-0.047802,0.629287,0.035259,-0.584107,-0.049334,0.575843,-2.005901,4.685754
17069,last night,-0.096542,0.307889,-0.021180,-0.217759,0.013030,0.053216,-1.990629,3.498270
148,24 hours,-0.001901,-0.211463,-0.014981,0.151200,0.088570,0.123890,-1.520412,-1.168945
31947,to day,-0.095604,-0.595487,-0.027128,0.728558,-0.032674,0.737763,-1.325280,0.666183
23451,past month,-0.030599,0.449282,0.027431,-0.459722,0.017440,0.046159,-1.288536,0.141893
23444,past 24,-0.070827,-0.107561,-0.027567,0.232727,-0.054697,-0.385144,-1.228682,-1.712842
32643,tomorrow,-0.103769,0.140816,-0.073644,0.031874,-0.047070,-0.195811,-1.178286,-2.096524
7953,dinner,0.074272,0.442286,-0.053177,-0.485103,0.020234,0.186372,-1.163216,3.795123
146,24,-0.057150,0.085570,-0.030648,-0.038848,0.050486,0.193178,-1.157928,0.166431


As you'd expect, most of these words are related to time.  The words with negative coefficients refer to recent periods of time like "yesterday" and "last night", as well as some words that seem to refer to mildly positive recent pastimes like "the gym" or "dinner".  Words with longer periods like "months" and "years" have high coefficients.

Finally we can look at the coefficients for age:

In [46]:
coef_table.sort_values('Age')

Unnamed: 0,Word,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect,Age
33276,university,-0.112789,-0.794852,-0.036031,0.916225,-0.031054,-0.789694,0.386445,-11.029000
20227,my girlfriend,-0.086080,-1.639607,-0.005678,1.568184,-0.025964,-1.342546,0.059385,-10.679214
20013,my boyfriend,0.136661,-1.757143,0.084108,1.300937,-0.016660,-0.714093,-0.054077,-9.318716
16433,its,-0.057549,0.532934,-0.029945,-0.449155,-0.062546,0.789864,-0.613037,-9.130062
19267,moment,-0.207233,-0.231628,0.040109,0.354902,-0.071733,-0.017191,-0.440469,-9.122100
26706,semester,-0.031985,-0.522817,-0.042893,0.584138,-0.028125,-0.631999,1.337627,-9.049690
24540,professor,-0.094654,-0.545315,-0.011777,0.630830,-0.013010,-0.523777,0.324469,-8.978005
20246,my grandmother,-0.134202,-0.183785,-0.035604,0.347234,-0.050941,-0.363062,0.242774,-8.829104
20215,my friends,-0.103700,-0.172809,0.038222,0.313935,-0.083642,-0.191874,0.028886,-8.637880
12117,girlfriend,-0.068545,-1.350071,-0.048471,1.388360,-0.055157,-0.986142,0.151666,-8.622492


Many of the words with negative coefficients are related to college life ("university", "exam", "professor", etc.).  This makes sense because, in this data set, the youngest people were around college age (17 or so), so many of their happy moments might be connected with their college experiences.  We also see some more references to girlfriends and boyfriends, which are the types of relationships people are more likely to talk about when they're younger.

The words with high coefficients are quite fascinating.  The highest scores come from words referring not just to children but to grandchildren!  I hadn't expected that but it makes total sense, since someone who is happy about something related to grandchildren is probably old enough to have grandchildren.  Words related to children also have positive scores here.  One weird thing is that "00", which isn't even a real word, has a really high score.  I'm not sure why that might be.  Old people paying more attention to what time it is?

One final thing I thought to do is look at the correlations between all these coefficient scores.  Pandas has a method that does this for me:

In [47]:
coef_table.corr()

Unnamed: 0,Divorced,Married,Separated,Single,Widowed,Parent,LongReflect,Age
Divorced,1.0,-0.166045,0.000719,-0.211172,0.025963,0.124943,0.009714,0.255892
Married,-0.166045,1.0,-0.060477,-0.9072,-0.048694,0.627905,-0.027815,0.318963
Separated,0.000719,-0.060477,1.0,-0.088839,0.005598,0.054916,0.011485,0.078472
Single,-0.211172,-0.9072,-0.088839,1.0,-0.093817,-0.684243,0.0234,-0.441579
Widowed,0.025963,-0.048694,0.005598,-0.093817,1.0,0.082203,-0.002543,0.166427
Parent,0.124943,0.627905,0.054916,-0.684243,0.082203,1.0,-0.030726,0.412001
LongReflect,0.009714,-0.027815,0.011485,0.0234,-0.002543,-0.030726,1.0,-0.007717
Age,0.255892,0.318963,0.078472,-0.441579,0.166427,0.412001,-0.007717,1.0


The results show that all the marital statuses are negatively correlated with each other, which makes sense because in this dataset each person was marked with only one marital status, so having one makes it less likely that you'd have any other status.  But Married and Single have a very strong negative correlation of -0.9, meaning they're just about complete opposites.

The Married status is highly correlated with Parent, and Single is negatively correlated with Parent.  This makes sense because if people have kids it's more likely that they're also married.

Then if we look at Age, it has medium positive correlations with Divorced, Married, and Parent, and a negative correlation with Single.  That also makes sense in that older people are more likely to be married, divorced, or have kids, and are less likely to be single.

LongReflect has very low correlations with all other categories.  I suppose this means that this dimension is not really related to any of the others.  That also makes sense because the others are demographics that are measuring characteristics of the person themselves, whereas the "reflection period" is really only about the instructions they received when doing the task (should they think of a moment in the last 24 hours or in the last 3 months).  So it's not really connected to "who they are as a person" in the way that the others are.

I also did a couple scatter plots of some of these pairs of features to get an overall picture of the distribution.

In [48]:
coef_table.plot.scatter(x="Married", y="Parent", alpha=0.3)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x109b5198>

The above scatter plot shows the Married coefficient and the Parent coefficient.  Most of the coefficients are relatively small (between -1 and 1), but they form a large blob that has a clear angle, meaning that even within this group of fairly weak word associations, Married and Parent are associated with the same words.  There are sparse tails of words with very high or low coefficients, which correspond to the words we looked at above ("son", "wife", etc.).  In the above analysis, I only looked at the top few most strongly associated words for each demographic characteristic.  Looking at this plot is helpful because it confirms that the correlation between these two characteristics is not just apparent for words at the top end, but for the main bulk of words as well.

Here's a similar plot of Age versus Parent:

In [49]:
coef_table.plot.scatter(x="Age", y="Parent", alpha=0.3)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x10229a20>

Again there is a slight upward tilt to the main blob of points, indicating that, on average, the more association a word is with older age, the more it is also associated with higher likelihood of parenthood.

If we look at Age versus LongReflect. . .

In [50]:
coef_table.plot.scatter(x="Age", y="LongReflect", alpha=0.3)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x10f05278>

This confirms what we saw in the correlations.  The blob of points is basically circular, with no real relationship between these two variables.

### Conclusion

This was an interesting project.  Since the text here was brief descriptions of things that made people happy, and I looked at what words were associated with different demographic characteristics, I sort of got a sense of what makes different kinds of people happy.  In particular, kids, spouses, and other relationships seem to be frequent topics of happiness.  I can't really tell if these things make people happy, because all I was looking at here was whether these things predict the demographic characteristics, not the other way around.  So I can't tell if having kids makes people happier; all I can tell is that if people talk about being happy about their kids, they probably have kids.

That seems kind of obvious, of course.  But maybe with other data I could take this a step further.  For instance, if I had a dataset where people were just talking about events in their lives (like maybe tweets), it would be interesting to see if mentions of these things were more likely to be positive.  Like maybe if someone posts or tweets about their kids or grandkids, it's more likely to be a positive tweet than a negative one.  That would be an interesting project for sentiment analysis.

I also had some idea about the "reflection period" dimension.  One thing that would be interesting is to see if the types of happy events that people thought of were different for the long and short reflection periods.  I saw a little of this in that words like "dinner" appeared with negative coefficients, meaning if someone mentioned "dinner" that was more likely to be a recent happy moment.  It seems like, in general, if someone asked you for a happy moment in the past three months, you might try to think of something larger-scale, like a job promotion, a vacation, or whatever, rather than something small like a nice dinner or a party you went to.  This could be a window into a different kind of sentiment analysis, where you could try to separate "small"  or "transitory" happy moments (minor things like a nice dinner) from "big" or "important" happy moments (like a vacation, a pay raise, a new baby being born, etc.).  I might be able to use the category information to get at this, because it does include separate categories for things like "achievement", which I'd imagine are more major happy moments, vs. "enjoy the moment" or "leisure" which maybe would be smaller-scale moments of happiness.  I might need more data though because only a small proportion of the data points had category information.

This kind of project could also be useful in an applied context.  As an example, suppose I worked for a company like Amazon that has lots of product reviews.  If I saw that people gave lots of 5-star reviews and mentioned things like "my son" or "my daughter" in the review, I could perhaps infer that the people buying the product were older, or that the product was a good gift for kids.  Or, on a more basic level, it could just let me know that this particular consumer has kids and likes to buy things for them.  This could help in recommending products, because you'd know not just to recommend things that the person themselves might want, but things that their kids might want.

A lot of aspects of this project might generalize to other kinds of data.  For instance, knowing these words predict things like age and marital status, you might be able to take these models and predict the age and marital status of people even based on other text, like Yelp reviews or Amazon reviews.  That could be useful demographic information when targeted ads or the like.

One thing that maybe makes the coefficients not as meaningful, though, is that the models were only medium good.  These coefficients were the model's judgement of how much a given word affected the decision of whether someone was married, a parent, etc.  But the model's accuracy was far from perfect, so looking at how it makes its judgements may not be super informative.  It's like, if someone takes a math quiz and gets 75% correct, that's maybe not the person you want to ask to explain how he got his answers.  Similarly, asking a mediocre model how it made its decisions may not tell us that much.  But the models were a good deal better than random chance, so their coefficients probably do contain some useful information.  It would be interesting to spend a bit more time trying to improve the models' accuracy, and then see if that makes the coefficients even more enlightening.

Overall this was a fun project and a good way to practice with scikit, as well as exploring the interpretation of the model coefficients in a more subjective way.