<a href="https://colab.research.google.com/github/melinaomueller/DLATK/blob/main/DLATK_Colab_Tutorial_for_Differential_Language_Analysis_(Getting_Started).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> DLATK - Colab Tutorial for Differential Language Analysis and Prediction </h1>

<br/>

✋ **NOTE** - You need to create a copy of this notebook before you work through it. This can be done by clicking on "Save a copy in Drive" option in the File menu.

<br/>

This tutorial covers:

<img src="https://drive.google.com/uc?export=view&id=16SgSLj9KkmxUX8LEeFNsq0GRE4gqvQux" height="170"/>

# Setup Colab


**This required code block will setup your Colab environment for dlatk by:**

1. Installing `dlatk` and its dependencies.
2. Telling dlatk it is in colab mode by running `--colabify`.
3. Adjusting colab's output format.

In [None]:
!pip install dlatk[wordcloud,langid]
!dlatkInterface.py --colabify

#We also ask colab to shorten the output rows so it's easier to scroll
from dlatk.tools.colab_methods import colab_shorten_and_bg
colab_shorten_and_bg() #keeps the output block short and changes the output background color



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-04-28 21:44:39
-----
-------
Settings:

Database - None
Corpus - None
Group ID - user_id
-------
Interface Runtime: 1.10 seconds
DLATK exits with success! A good day indeed  ¯\_(ツ)_/¯.


👆 You should see `DLATK exits with success! A good day indeed  ¯\_(ツ)_/¯.` if the install and colabify was complete.

✋ **NOTE** - You may also see a line reading "ERROR: pip dependency ..." but that is ok as long as you saw `DLATK exits with success!`.

<br />



<a name="data"></a>
# Setup data

**Note: This setup data section is optional.** Read below if you wish to understand how to format data.

<br />

DLATK uses data in CSV (comma-seprated value) format. Your csv needs at least two columns:
* **`message`**: contains the text to be analyzed.
* **`message_id`**: contains a unique id for each message.

<br />

The CSV may also have other columns, including other ids you may wish to analyze your data by (e.g. `user_id`). For example,

|message_id|message|user_id|created_date|
|----------|-------|-------|------------|
|17557|New rules on pizza to be introduced|1405024|2004-05-27|
|15996|I just read John Kerry's nomination acceptance speech. Christ, he can talk.  God bless America.|3523319|2004-07-29|
|27462| Talk about 'better late than never'...   urlLink  Couple Living Together 77 Years Marries.   Thanks to  Zorak  for the link.|942828|2003-04-04|

<br />
<br />

Now that you are familiar with the basic data format, you have two options:

👉 **1. If you wish to use the [default tutorial dataset](https://github.com/dlatk/dlatk/blob/public/dlatk/data/colab_dataset.md), then you can skip directly to the [Run Differential Language Analysis](#dla) section.** The default dataset contains the files mentioned below -
* `msgs404u.csv` - language data from 404 blog authors.
* `users404.csv` - age, gender, and occupation for the 404 authors.

<br />

👉 **2. Read below if you wish to use your own data.**

<br/>

## Setup data: Upload a custom CSV to Google Drive (Optional)



* **If your data is not in Drive**, then you need to upload it by running the  `upload_dataset()` command as shown below. When you run it, click "Allow" for colab to access your drive. <br />
<em>Again, you can also skip this optional step and jump ahead to "[Run Differential Language Analysis (DLA)](#dla)". You will use the default tutorial data, which is recommended for first time users.</em>

<br/>

* **If your data is already in Drive**, pass the file name to the command as an argument and your dataset will be downloaded to Colab. For example, if your file was in a folder named `DATA` and it was called `interesting_text.csv` then you would run `upload_dataset("interesting_text.csv", "DATA")`


In [None]:
from dlatk.tools.colab_methods import upload_dataset
upload_dataset()

👆 You should see `File FILENAME.csv copied successfully from Google Drive`.

✋ **NOTE** - If you get `MessageError: Error: credential propogation was unsuccessful` then you did not log in correctly.

<hr />
<br/>

**You are now ready to do some language analyses!**

If your text data needs some cleaning, consider our [data cleaning tutorial](https://dlatk.github.io/dlatk/tutorials/tut_data_cleaning.html). It goes over, e.g., filtering to a language, removing common social media markup, and removing duplicates. For now it is not in colab, but all of the commands also work in colab.

<br />

<a name="dla"></a>
# Run Differential Language Analysis (DLA)

<img src="https://drive.google.com/uc?export=view&id=1TArhnhbdHThRL8_sebYBgkWkxZdQ9b_P" height="350" width="700"/>

<br/>



**dlatkInferface.py.** <br />
<hr />

The most popular way people use dlatk is through it's command interface: `dlatkInterface.py`.

For all commands, DLATK uses the following **mandatory settings** -

>  **`-d`** - **d**atabase that will contain all data dlatk work with. This is often just your project name. <br />
 **`-t`** - message **t**able here our text lives. If it's not loaded yet (as in this case), we can provide the name of the CSV file to be loaded into the database. <br />
 **`-g`** - **g**roup_id, the table column we will aggregate (unit of analysis).

 For this tutorial, all of your commands will begin with: `!dlatkInterface.py  -d colab_csv -t msgs404u.csv -g user_id` <br />
> `colab_csv` is the **d**atabase, `msgs404u.csv` is the message_**t**able, and `user_id` is the **g**roup_id.


<hr />
<br />


## DLA: Step 1 - Extract word and phrase (ngrams) features

The first step of DLA is to extract features from the language. Here, we do this by extracting words and 2-word phrases known as ngrams or "tokens" -- this process of splitting sequences of letters into words is called "tokenization".

<br/>

To do this, we use the following **ngram extraction flags** in addition to the mandatory settings.
> **`--add_ngrams`** - the flag which starts the ngram extraction process<br />
 **`-n 1 [2] [3]`** -  the value or values for n in ngrams<br />
 **`--combine_feat_tables [NAME]`** - there will be 1 feature table for each `n`, this command concatenates them into one table with the provided `NAME`.
 <br />

Putting it all together, this command will extract 1grams and 2grams from the `msgs404u.csv` file, and store them into a single "feature table" named `1to2gram`.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id \
  --add_ngrams -n 1 2 \
  --combine_feat_tables 1to2gram

<IPython.core.display.Javascript object>



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-04-28 21:47:00
-----
Connecting to SQLite database: /content/sqlite_data/colab_csv.db
query: PRAGMA table_info(msgs404u)
SQL Query: DROP TABLE IF EXISTS feat$1gram$msgs404u$user_id
SQL Query: CREATE TABLE feat$1gram$msgs404u$user_id ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(36), value INTEGER, group_norm DOUBLE)


Creating index correl_field on table:feat$1gram$msgs404u$user_id, column:group_id 


SQL Query: CREATE INDEX correl_field$1gram$msgs404u$user_id ON feat$1gram$msgs404u$user_id (group_id)


Creating index feature on table:feat$1gram$msgs404u$user_id, column:feat 


SQL Query: CREATE INDEX feature$1gram$msgs404u$user_id ON feat$1gram$msgs404u$user_id (feat)
query: PRAGMA table_info(msgs404u)
SQL Query: DROP TABLE IF EXISTS feat$meta_1gram$msgs404u$user_id
SQL Query: CREATE TABLE feat$meta_1gram$msgs404u$user_id ( id INTEGER PRIMARY KEY, group_id INTE

👆 If you see `DLATK exits with success!` then the command ran successfully and the features are ready!
<hr />
<br />
<br />


To check the created feature tables, you can use the **`--show_feat_tables`** flag as shown below.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id \
  --show_feat_tables

👆 You should see 5 feature tables:
```
feat$1gram$msgs404u$user_id
feat$meta_1gram$msgs404u$user_id
feat$2gram$msgs404u$user_id
feat$meta_2gram$msgs404u$user_id
feat$1to2gram$msgs404u$user_id
```

1.   `feat$1gram$msgs404u$user_id` contains the 1grams (single words).
2.   `feat$meta_1gram$msgs404u$user_id` contains count data about 1grams.

Similarly `feat$2gram$msgs404u$user_id` and `feat$meta_2gram$msgs404u$user_id` contain 2grams and their count data respectively, while `feat$1to2gram$msgs404u$user_id` is the concatenated table from `feat$1gram$msgs404u$user_id` and `feat$2gram$msgs404u$user_id`.

<hr />

<br />

<hr />

So, what is a **Feature Table**? - A feature table is a data type to store extracted representations of the language. All feature tables contain at least the below columns -

* `group_id` - individual units of analysis (group) of data
* `feat` - extracted features, which can be tokens or lexicon features (to be discussed), embeddings, etc.
* `value` - frequency value of the feature in the group.
* `group_norm` - the normalized frequency of the feature (i.e. the relative frequency of the word) within the group.

<hr />
<br />

You can validate the feature table structure, and also their contents, by looking at the first 10 rows of feature tables using **`--view_tables`** flag:

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id \
  -f 'feat$1gram$msgs404u$user_id' \
  --view_tables

👆 Scroll and you should see the first 10 rows of the message table and the first 10 of the feature table.

<br />


## DLA: Step 2 - Feature Filtering

The number of distinct tokens can be large in comparison to the sample size. However, a lot of these tokens are particular to just 1 or 2 examples (see [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law) ), and not representative of the sample, and hence can be eliminated. Flags required to **filter such outlying words/phrases** are:

> **`--feat_occ_filter`** - create a new feature table with infrequent features removed.<br />
 **`--set_p_occ`** - features used by less than this percentage of groups are dropped. <br />
 **`--feat_colloc_filter`** - create a new feature table without rare sequences of words (limiting to [collocations](https://nlp.stanford.edu/fsnlp/promo/colloc.pdf)).

<br/>

Similarly, when running filters, it is also good to specify a minimum word count per group (user in this case) required for the group to be considered:

> **`--group_freq_thresh`** -  ignore groups which do not contain a certain number of words when running analyses (e.g. when calculating p_occ)

👇 For example, below command considers features (from `feat$1to2gram$msgs404u$user_id`) used by atleast `10%` of the groups, and groups with atleast `100` tokens.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id --group_freq_thresh 100 \
  -f 'feat$1to2gram$msgs404u$user_id' \
  --feat_occ_filter --set_p_occ 0.10 --feat_colloc_filter

👆 If you see `DLATK exits with success!`, then you have successfully filtered the features. The name of the filtered feature table - `feat$1to2gram$msgs404u$user_id$0_1$pmi3_0`.
<hr />
<br />

To validate the structure and content of the created table, you can use `--view_tables` like above.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id \
  -f 'feat$1to2gram$msgs404u$user_id$0_1$pmi3_0' \
  --view_tables

👆 You should see the first 10 rows of the `feat$1to2gram$msgs404u$user_id$0_1$pmi3_0` feature table.

<br />
<hr />
<br />

## DLA: Step 3 - Correlate (or associate) features with outcomes



Now that you have the language features, you can correlate them against outcomes like `age` while controlling for another variable like `gender`.


<img src="https://drive.google.com/uc?export=view&id=11ODkDFwsfQFc-5R5o757k3Ij9XbW9_Yb" height="250" width="500"/>

<br/>

To do this, you need to tell DLATK the feature table to use and the outcomes:

* **`-f 'TABLE_NAME'`** - names of the feature table (`feat$1to2gram$msgs404u$user_id$0_2$pmi3_0` in this case)
* **`--outcome_table NAME`** - the name of the table with outcomes (`users404.csv`, can be the message table if it contains the outcomes)
* **`--outcomes OC1 [OC2...]`** - list of outcomes to be associated with (`age` in this case)
* **`--controls C1 [C2...]`** - list of statistical controls for the ascociation (we will control for `gender` in this example).

<br/>

Then, you specify what that you want a correlation matrix output with **`--rmatrix`** flag, and because the `gender` variable is categorical, we
 **`--cat_to_bin gender`** which converts the variable into [one-hot representation](https://wandb.ai/ayush-thakur/dl-question-bank/reports/How-One-Hot-Encoding-Improves-Machine-Learning-Performance--VmlldzoxOTkzMDk).


👇 For example, the command below associates the ngram features (`feat$1to2gram$msgs404u$user_id$0_1$pmi3_0`) with the `age` of the user, controlling for `gender`, and store the results into an HTML file, by default using [standardized multiple regression](link).

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id --group_freq_thresh 100 \
  --feat_table 'feat$1to2gram$msgs404u$user_id$0_1$pmi3_0' \
  --outcome_table users404.csv --outcomes age \
  --controls gender --cat_to_bin gender \
  --rmatrix

👆 If you see `DLATK exits with success!` then the command executed successfully. The command should have stored an HTML (`feat.1to2gram.msgs404u.user_id.0_1.pmi3_0.age.gender__0.freq100.rMatrix.html`) with the results in the `Files` tab.

<br/>

<img src="https://drive.google.com/uc?export=view&id=1PaP-OJH04D4ta-iV-nvpdel-K7oZyq8O" height="300" width="600"/>

<hr />
<br />

In addition to the tabular format, you can also observe the correlated features in the form of a word clouds. Flags required are **`--make_wordclouds`** (as shown below).



In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id --group_freq_thresh 100 \
  --feat_table 'feat$1to2gram$msgs404u$user_id$0_1$pmi3_0' \
  --outcome_table users404.csv --outcomes age \
  --controls gender --cat_to_bin gender \
  --rmatrix \
  --make_wordclouds

👆 The above command produces the HTML (as above) and wordclouds stored in the `feat.1to2gram.msgs404u.user_id.0_1.pmi3_0.age.gender__0.freq100._tagcloud_wordclouds` folder under the `Files` tab.

<br/>

<img src="https://drive.google.com/uc?export=view&id=1_xiIFVjA6aYHcrnR8KJwKUdZaHdzWfQz" height="300" width="600"/>

<hr />
<br />
<br />

You can view the word clouds using the `print_wordclouds()` function below which takes in the path to the folder containing the word clouds. For example, to view the word clouds extracted earlier, the command would be - `print_wordclouds("feat.1to2gram.msgs404u.user_id.0_1.pmi3_0.age.gender__0.freq100._tagcloud_wordclouds")`.

In [None]:
from dlatk.tools.colab_methods import print_wordclouds
print_wordclouds("feat.1to2gram.msgs404u.user_id.0_1.pmi3_0.age.gender__0.freq100._tagcloud_wordclouds")

👆You should see two word clouds. The negative (words associated with younger age) and the positive (words and phrases associated with older age).

## DLA: Using Other Features - Lexicon Features (Optional)

You can also generate language features from a lexicon (calculate the proportion of the lexicon present in the text) of your choice.

Sometimes this is as simple as aggregating counts (in case of unweighted lexicon like LIWC) while sometimes there is a weighting factor involved (in case of weighted lexicon like `dd_permaV3` which measures an individual's well-being as per the PERMA scale).

<br/>

You need to provide the below flags in addition to mandatory settings -

* **`--add_lex_table -l LEX_TABLE_NAME`** - commands DLATK to add the lexicon from `LEX_TABLE_NAME`
* **`--weighted_lexicon` [Optional]** - flag to mention that the lexicon type is weighted.

👇 For example, you can extract features from *PERMA* dictionary as shown below.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id --group_freq_thresh 100 \
  --add_lex_table -l dd_permaV3 \
  --weighted_lexicon

<IPython.core.display.Javascript object>



TopicExtractor: gensim Mallet wrapper unavailable, using Mallet directly.

-----
DLATK Interface Initiated: 2025-04-28 21:46:41
-----
Connecting to SQLite database: /content/sqlite_data/colab_csv.db
SQL Query: CREATE TABLE msgs404u (message_id INT, user_id INT, created_date VARCHAR(10), message LONGTEXT);
Importing data, reading msgs404u.csv file
Reading remaining 2392 rows into the table...
query: PRAGMA table_info(msgs404u)
SQL Query: DROP TABLE IF EXISTS feat$cat_dd_permaV3_w$msgs404u$user_id$1gra
SQL Query: CREATE TABLE feat$cat_dd_permaV3_w$msgs404u$user_id$1gra ( id INTEGER PRIMARY KEY, group_id INTEGER, feat VARCHAR(10), value INTEGER, group_norm DOUBLE)


Creating index correl_field on table:feat$cat_dd_permaV3_w$msgs404u$user_id$1gra, column:group_id 


SQL Query: CREATE INDEX correl_field$cat_dd_permaV3_w$msgs404u$user_id$1gra ON feat$cat_dd_permaV3_w$msgs404u$user_id$1gra (group_id)


Creating index feature on table:feat$cat_dd_permaV3_w$msgs404u$user_id$1gra, column:feat 

👆 The command executed successfully if `DLATK exits with success!`.

<hr />

<br />

You should see the name of the new feature table - `feat$cat_dd_permaV3_w$msgs404u$user_id$1gra`, which you can validate like earlier.


In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id \
  -f 'feat$cat_dd_permaV3_w$msgs404u$user_id$1gra' \
  --view_tables

👆 You should see the contents of `feat$cat_dd_permaV3_w$msgs404u$user_id$1gra`
<hr />

<br />

Once you have represented your data samples in terms of lexicon features, you can also correlate them with outcomes, by changing the feature table in `--feat_table` (in this case `--feat_table 'feat$cat_dd_permaV3_w$msgs404u$user_id$1gra'`). Everything else in the command remains the same.

Now, correlate the new set of features against `age` controlling for `gender`, like you did above.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id --group_freq_thresh 100 \
  --feat_table 'feat$cat_dd_permaV3_w$msgs404u$user_id$1gra' \
  --outcome_table users404.csv --outcomes age \
  --controls gender --cat_to_bin gender \
  --rmatrix

👆 The above command produces an HTML (`feat.cat_dd_permaV3_w.msgs404u.user_id.1gra.age.gender__0.freq100.rMatrix.html`) with the results, under the `Files` tab.

<br/>

# Running Prediction (Language-based Assessment)

<img src="https://drive.google.com/uc?export=view&id=1AgMttURoLyJZ6WhL9wnVzWBt1gSmE0yE" height="350" width="550"/>

## Prediction: Example 1: N-Fold Cross Validation

There are two types of [prediction models](https://scikit-learn.org/stable/modules/linear_model.html#) we can run:
1. **Regression**, which is the prediction of a continous variable, and
2. **Classification**, which is the prediction of a categorical variable.

<br/>

<hr/>

[N-Fold cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#) - This method involves randomly dividing the dataset into *N* groups, or "folds", of approximately equal size. The model is later fit on the N-1 folds (*train* data) and the evaluated for accuracy over the remaining one fold (*test* data).

<hr/>

<br/>

For running a prediction model against an outcome, you use the same data setup commands as in correlations:
* **`-f FEAT_TABLE_NAME`** - names of the feature table.
* **`--outcome_table NAME`** - the name of the table with outcomes.
* **`--outcomes OC1 [OC2...]`** - list of outcomes to predict.


### N-Fold Regression:


Below are the flags to perform regression using cross-validation:
* **`--nfold_test_regression`** - this activates regression using n-fold cross-validation.
* **`--model MODEL_NAME`** - which regression model to use (some examples would be ordinary least sqaures, ridge regression, etc.)

👇 For example, below command will predict `age` from 1gram and 2grams features using a ridge regression model over `5` fold cross validation.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id --group_freq_thresh 100 \
  --outcome_table users404.csv --outcomes age \
  --feat_table 'feat$1to2gram$msgs404u$user_id$0_1' \
  --nfold_test_regression --model ridgecv

👆 Above the "Settings", you should see a dictionary containing accuracy metrics from cross validation. Key metrics are `r` and `mae`:
```
[TEST COMPLETE]

{'age': {(): {1: {'N': 404,
                  ...
                  'mae': 5.392188704337425,
                  'num_features': 2308,
                  'r': 0.580652941321456,
                  ...}}}}
```
* `mae` is the mean absolute error aggregated across all examples from when they were in a test fold. In this case, the model on average is off in predicting age by 5.4 years.

* `r` is the Pearson correlation between the predicted age and the self-reported age. The correlation is a nice accuracy metric for regression prediction tasks  since it is bounded at 1 being a perfect prediction and 0 being what is expected by chance.  

DLATK uses 5 folds by default, but you can change this by adding the
`--folds K` parameter.

<br/>


### N-Fold Classification


Similarly, you can also perform *classification* (i.e. [predicting categorical outcomes](https://en.wikipedia.org/wiki/Statistical_classification)) using cross-validation using the below flags:

* **`--nfold_test_classifiers`** - activates classification using cross-validation.
* **`--model MODEL_NAME`** - classification model to use (for example logistic regression, etc.).

<br/>

You can also store the prediction output to a CSV with the below flags:
* **`--csv`** - Saves the results to a csv file instead of printing to the screen, like with `--correlate`.
* **`--pred_csv`** - write the predicted scores for the sample to a separate CSV prefixed with the name in `--output_name`.

<br/>

👇 Try predicting if a user is a student or not (`is_student`) from their 1gram and 2gram features using Logistic Regression (`lr`) in 5 fold cross-validation.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id --group_freq_thresh 100 \
  --outcome_table users404.csv --outcomes is_student  \
  --feat_table 'feat$1to2gram$msgs404u$user_id$0_1' \
  --nfold_test_classifiers --model lr

👆 The above command produce the key validation metrics like `acc` (accuracy) which gives the fraction of users correctly classified as student (`72.2%` in this case, as compared to a random or *most frequent class accuracy* (mfc_acc) of `56.1%`). Simple percentage acccuracy is not often the best metric for classification so it also includes [f1 score](https://en.wikipedia.org/wiki/F-score) and [auc](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc).

```
[is_student]
   NO CONTROLS
     + LANG: acc: 0.722, f1: 0.708, auc: 0.745 (p_vs_controls = 1.0000)
   (mfc_acc: 0.561)
```

<br/>

## Prediction - Example 2: Training and Deploying a model.

### Train and Deploy - Step 1: Train and save a model


Another approach is to train a predictive model, store it and use it to predict the outcomes in a different dataset.

<br/>

In addition to the flags that point to the right tables, flags necessary for this command are -

* **`--train_regression`** - trains a regression model.
* **`--model MODEL_NAME`** - which machine learning model to use.
* **`--save_model --picklefile FILENAME`** - saves the model into a pickle file `FILENAME`

<br/>

👇 In the below example, you will build and save a model (using [ridge regression](what_is_ridge) with a penalty of 1000) that predicts `age` from 1gram and 2gram features (`feat$1to2gram$msgs404u$user_id$0_1`), and save it into a file named `age.1to2grams.ridge1000.gft100.pickle`.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs404u.csv -g user_id --group_freq_thresh 100 \
  --outcome_table users404.csv --outcomes age  \
  -f 'feat$1to2gram$msgs404u$user_id$0_1' \
  --train_regression --model ridge1000 \
  --save_model --picklefile age.1to2grams.ridge1000.gft100.pickle

### Train and Deploy - Step 2: Predict from new data



Once you have trained a model to predict an outcome, you can use it to predict from unseen data (or *test* data).

👉 If you wish to use a default *test* dataset, skip the next command.

👇 To load your dataset to test the model, you can use the `upload_dataset` function from the [Setup data](#data) section above.

In [None]:
#ONLY RUN THIS IF YOU WISH TO ADD NEW DATA
from dlatk.tools.colab_methods import upload_dataset
upload_dataset()

👇 You need to extract the same set of features as the ones used to train the predictive model. So, extract 1gram and 2gram features, used by atleast 10% of the groups, and consider only the groups with atleast 100 tokens.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs100u.csv -g user_id --group_freq_thresh 100 \
  --add_ngrams -n 1 2 \
  --combine_feat_tables 1to2gram \
  --feat_occ_filter --set_p_occ 0.10

👆 You should see the name of the new feature table - `feat$1to2gram$msgs100u$user_id$0_1$pmi3_0` (you can use `--view_tables` to check the table).
<hr /><br />



Once you have all the features required by the prediction model, you can predict the outcome using the flags:

* **`--predict_regression_to_outcome_table TABLE_NAME`** - predicts the outcomes into a table named `TABLE_NAME`.
* **`--load --picklefile FILENAME`** - load the model from the file `FILENAME`.

In [None]:
!dlatkInterface.py \
  -d colab_csv -t msgs100u.csv -g user_id --group_freq_thresh 100 \
  -f 'feat$1to2gram$msgs100u$user_id$0_1' \
  --predict_regression_to_outcome_table lbp_age \
  --load --picklefile age.1to2grams.ridge1000.gft100.pickle

👆 You should see the name of the new table with the predictions - `p_ridg$lbp_age`. You should also see the top-3 messages and predicted age for 2 users picked at random. For example -

```
...
Example predictions
-------
Group ID: 4088894
Top 3 messages for the group:
I have been a member of the Dragonswood site...

Well, this weekend is going to me quadrupally special for me...

Well, I've never run a blog before...

Prediction: 27.350689449787602
-------
...
```

## Authors and References

* [Shashanka Subrahmanya](https://github.com/shshnk94) - Lead Colab Developer
* [H. Andrew Schwartz](https://www3.cs.stonybrook.edu/~has/) - DLATK Creator / Colab Mentor
* [Johannes Eichstaedt](https://jeichstaedt.com/) - Mentor
* [Adithya Ganesan](https://sjgiorgi.github.io/) - DLATK Developer
* [Salvatore Giorgi](https://sjgiorgi.github.io/) - DLATK Maintainer

### References:
**Colab:** <br />
Subrahmanya, S., Schwartz, H. A., Eichstaedt, J. C., Ganesan, A., & Giorgi, S. (2024). DLATK - Colab Tutorial for Differential Language Analysis. [*github.com/dlatk/dlatk/blob/public/colab.md.*](https://github.com/dlatk/dlatk/blob/public/colab.md)

**DLATK:** <br />
Schwartz, H. A., Giorgi, S., Sap, M., Crutchley, P., Ungar, L., & Eichstaedt, J. (2017). Dlatk: Differential language analysis toolkit. In *Proceedings of the 2017 conference on empirical methods in natural language processing: System demonstrations* (pp. 55-60).

In [None]:
!sqlite3 /content/sqlite_data/colab_csv.db