# 2. Create and derive new variables

Tally has a set of powerful functions that allow users to create new variables and derive their values from variables that already exist in the dataset, using powerful logic.

These functions are
 - <a href="API/DataSet.html#tally_core.DataSet.derive">`DataSet.derive`</a> - to derive a new variable using logic
 - <a href="API/DataSet.html#tally_core.DataSet.code_count">`DataSet.code_count`</a> - to count the occurence of certain answers
 - <a href="API/DataSet.html#tally_core.DataSet.band">`DataSet.band`</a> - create a single choice from a number, like age groups


For more information about how to construct logical operators, refer to [Logic structures and functions](tally_logic).

In [6]:
#
# In order to run this notebook, you first have to install Tally. To install tally you need a token that gives you access.
#
from google.colab import files
import json
import io
import os
# Check if the file 'tally_keys.json' exists
if not os.path.exists('tally_keys.json'):
  uploaded = files.upload()
  # Assuming only one file is uploaded, get its filename and content
  filename = list(uploaded.keys())[0]
  file_content = uploaded[filename]
  # Load JSON directly from the uploaded content
  keys = json.loads(file_content.decode('utf-8'))
else:
  # If the file already exists, just load its content
  with open('tally_keys.json', 'r') as f:
      keys = json.load(f)

try:
  # Try to import the package
  import example_package
except ImportError:
  # If the import fails, the package is not installed. Install it.
  !pip install git+https://{keys['tally_api']}@github.com/datasmoothie/tally-core.git@master

ModuleNotFoundError: No module named 'google'

In [1]:
import tally_core as tc
import pandas as pd
import json
import requests
dataset = tc.DataSet('Museum')


meta = requests.get("https://github.com/datasmoothie/tally-documentation-notebooks/raw/main/data/Example_Museum.json").json()
data = pd.read_parquet('https://github.com/datasmoothie/tally-documentation-notebooks/raw/main/data/Example_Museum.parquet')
dataset.from_components(meta_dict=meta, data_df=data)

dataset2 = tc.DataSet("Sports stores")
meta = requests.get("https://github.com/datasmoothie/tally-documentation-notebooks/raw/main/data/Example Data (A).json").json()
data = pd.read_parquet('https://github.com/datasmoothie/tally-documentation-notebooks/raw/main/data/Example%20Data%20(A).parquet')
dataset2.from_components(meta_dict=meta, data_df=data)


## Derive new variables
The derive method uses Tally logic to create new variables, with codes assigned according to logic we supply. These can both be used to create "net" or "top 2/bottom 2" variables or any other variables we want our researchers to have access to.

We start with creating a variable that combines the `gender` variable and the `resident` variable. First, let's look at the meta-data for these variables:

In [2]:
dataset.meta('gender')
dataset.meta('resident')

single,codes,texts,missing
gender: Gender of respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,23,Male,
2,24,Female,


single,codes,texts,missing
resident: Do you live in this country?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,9,Yes,
2,10,No,
3,11,Not answered,


In [3]:
from tally_core.core.tools.view.logic import *

logic = [
  (1, "Male residents", intersection([{"gender":[23]}, {"resident":9}])),
  (2, "Female residents", intersection([{"gender":[24]}, {"resident":9}])),
  (3, "Male non-residents", intersection([{"gender":[23]}, {"resident":[10,11]}])),
  (4, "Female non-residents", intersection([{"gender":[24]}, {"resident":[10,11]}]))
]
dataset.derive('gender_resident', 'single', "Gender/resident", logic)

We can do a sanity check by aggregating the result.

In [5]:
dataset.crosstab(['gender', 'resident', 'gender_resident'])

Unnamed: 0_level_0,Question,Total
Unnamed: 0_level_1,Values,Total
Question,Values,Unnamed: 2_level_2
gender. Gender of respondent,Base,602.0
gender. Gender of respondent,Male,339.0
gender. Gender of respondent,Female,263.0
resident. Do you live in this country?,Base,602.0
resident. Do you live in this country?,Yes,428.0
resident. Do you live in this country?,No,174.0
resident. Do you live in this country?,Not answered,0.0
gender_resident. Gender/resident,Base,602.0
gender_resident. Gender/resident,Male residents,244.0
gender_resident. Gender/resident,Female residents,184.0


### Derive with interlock

The above is such a common request, there is a specific function for it. The <a href='API/DataSet.html#tally_core.DataSet.interlock'>`DataSet.interlock`</a> method takes a list of variables and creates a combination of every permutation of those variables.

In [9]:
dataset.interlock('gender_resident_v2', "Gender/resident", ['gender', 'resident'])
dataset.crosstab('gender_resident_v2')

Unnamed: 0_level_0,Question,Total
Unnamed: 0_level_1,Values,Total
Question,Values,Unnamed: 2_level_2
gender_resident_v2. Gender/resident,Base,602.0
gender_resident_v2. Gender/resident,Male/Yes,244.0
gender_resident_v2. Gender/resident,Male/No,95.0
gender_resident_v2. Gender/resident,Male/Not answered,0.0
gender_resident_v2. Gender/resident,Female/Yes,184.0
gender_resident_v2. Gender/resident,Female/No,79.0
gender_resident_v2. Gender/resident,Female/Not answered,0.0


## Code count

If we want to could how often certain codes appear in questions, we use <a href="API/DataSet.html#tally_core.DataSet.code_count">`DataSet.code_count`</a>, which supports single, multi-choice and array questions.

We start by looking at a variable where guests were asked to rate a particular part of our museum, an array variable called `rating.Column` (remember, `array` is the terminology Tally uses for grids and loops).

In [10]:
dataset.meta('rating.Column')

delimited set,items,item texts,codes,texts,missing
rating.Column: Q30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,rating[{other}].Column,Other,48.0,Not at all interested (1),
2,rating[{dinosaurs}].Column,Dinosaurs,49.0,Not particularly interested (2),
3,rating[{conservation}].Column,Conservation,50.0,No opinion (3),
4,rating[{fish_and_reptiles}].Column,Fish and reptiles,51.0,Slightly interested (4),
5,rating[{fossils}].Column,Fossils,52.0,Very interested (5),
6,rating[{birds}].Column,Birds,,,
7,rating[{insects}].Column,Insects,,,
8,rating[{whales}].Column,Whales,,,
9,rating[{mammals}].Column,Mammals,,,
10,rating[{minerals}].Column,Minerals,,,


We want to count how many departments our guests find interesting, so we count codes 51 and 52 (slightly and very interested).

In [11]:
dataset.code_count('rating.Column', count_only=[51, 52])

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
597    5.0
598    7.0
599    5.0
600    5.0
601    2.0
Length: 602, dtype: float64

This `pandas.Series` object can now be used to create a new variable in the dataset.

### Counting a subset of a grid
If we have large grids, or grids of more levels than one, we can use the <a href="API/DataSet.html#tally_core.DataSet.categories">`DataSet.categories`</a> function to only send selected items from the grid to the code count.

In [12]:
categories = dataset.categories('rating.Column', find="insects|whales|mammals")
categories

['rating[{insects}].Column',
 'rating[{whales}].Column',
 'rating[{mammals}].Column']

Note that the `categories` method supports regex, so `insects|whales|mammals` returns all categories that have any of those words.

In [13]:
dataset.code_count('rating.Column', count_only=[51, 52], items=categories)

0      0
1      0
2      0
3      0
4      0
      ..
597    1
598    3
599    0
600    2
601    0
Length: 602, dtype: int64

## Banding numbers into groups

If we want to create a categorical variable from a numeric variable, by supplying different ranges to brand answers into groups, we use <a href="API/DataSet.html#tally_core.DataSet.band">`DataSet.band`</a>.

:::{note} 
To demonstrate the `band` function, we are using a different dataset, stored in a variable called `dataset2`. It has `age` stored as a number, where the Museums demo dataset has already categorised it.

:::

First, we confirm that `age` is stored as a numeric.

In [14]:
dataset2.meta('age')

Unnamed: 0,int
age: Age,


Then, we decide on our bands. Bands can be defined with single numbers, tuples with a numeric range or a dict with a label and numeric range. For example, all of these are valid:

 - `0`
 - `(26, 35)`
 - `{"Twenty six to thirty five":(26, 35)}`

In [15]:
dataset2.band(
  name='age', 
  bands=[0, (1,17), (18,25), (26, 35), (36, 45), (46, 55), (56, 65), {"Older than 65":(66, 120)}], 
  new_name='age_groups', 
  label="Age groups"
)

Finally, we use the `crosstab` method for a sanity check.

In [16]:
dataset2.crosstab('age_groups')

Unnamed: 0_level_0,Question,Total
Unnamed: 0_level_1,Values,Total
Question,Values,Unnamed: 2_level_2
age_groups. Age groups,Base,8255.0
age_groups. Age groups,0,0.0
age_groups. Age groups,1-17,0.0
age_groups. Age groups,18-25,1896.0
age_groups. Age groups,26-35,2670.0
age_groups. Age groups,36-45,2654.0
age_groups. Age groups,46-55,1035.0
age_groups. Age groups,56-65,0.0
age_groups. Age groups,Older than 65,0.0
