# Data Science with Data Structures
Using the [Food Choices Dataset](https://www.kaggle.com/borapajo/food-choices?select=food_coded.csv) from Kaggle, the task is to read the dataset and process the information effectively by using basic data structures and functional techniques.

The data includes information on food choices, nutrition, preferences, etc. based on 125 responses. The data is raw and uncleaned in the form of a csv file.

Relevant data:
* GPA
* Gender
  * 1 - female
  * 2 - male
* drink: Which picture do you associate with the word drink?
  * 1 - Orange juice
  * 2 - Soda
* exercise: How often do you exercise in a regular week?
  * 1 - Everyday
  * 2 - Twice or three times per week
  * 3 - Once a week
  * 4 - Sometimes
  * 5 - Never
* fries: Which picture do you associate with the word fries?
  * 1 - McDonald’s fries
  * 2 - Homefries
* income:
  * 1 - less than \\$15,000 USD
  * 2 - \\$15,001 to \\$30,000
  * 3 - \\$30,001 to \\$50,000
  * 4 - \\$50,001 to \\$70,000
  * 5 - \\$70,001 to \\$100,000
  * 6 - Higher than \\$100,000
* sports: Do you do any sports activity?
  * 1 - Yes
  * 2 - No
* weight: An open-ended question - What is your weight in pounds?

Note:
* some values are nan, personal, or unknown
* all values are enclosed in single quotes ''

In [1]:
import csv

with open('food_coded.csv', newline='') as f:
    reader = csv.reader(f)
    data = list(reader)

# Columns you'll deal with in this project
required_columns = ['GPA', 'Gender', 'drink', 'exercise', 'fries', 'income', 'sports', 'weight']

## Filter and group the data
* Filter the data to include only the required columns
* Group the data to be tuples of (type_of_data, data) 

In [2]:
def filter_required_data(data, required_columns):
    extracted_data = []
    for i, student in enumerate(data):
        extracted_data.append([entry for j, entry in 
                               enumerate(data[i]) if 
                               data[0][j] in required_columns])
    return extracted_data

filtered_data = filter_required_data(data, required_columns)
filtered_data[0:4]

[['GPA', 'Gender', 'drink', 'exercise', 'fries', 'income', 'sports', 'weight'],
 ['2.4', '2', '1', '1', '2', '5', '1', '187'],
 ['3.654', '1', '2', '1', '1', '4', '1', '155'],
 ['3.3', '1', '1', '2', '1', '6', '2', "I'm not answering this. "]]

In [3]:
def group_values_with_columns(data, n=50):
    '''
    Groups the first n entries of the data. n is 50 by default.
    '''
    # This truncates n to be at most the number of
    # columns in data, in case the user entered a 
    # number that's too large.
    if n > len(data):
        print("Warning: n [", n, "] exceeds the number of data entries.",
              "It is truncated to n=", len(data), ".")
    grouped_data = []
    for student in data[1:]:
        grouped_data.append(list(zip(data[0], student)))
        
    return grouped_data

grouped_data = group_values_with_columns(filtered_data, n = 4)   
grouped_data[0]

[('GPA', '2.4'),
 ('Gender', '2'),
 ('drink', '1'),
 ('exercise', '1'),
 ('fries', '2'),
 ('income', '5'),
 ('sports', '1'),
 ('weight', '187')]

## Helper functions to count data based on a condition


In [4]:
def count(f):
    def wrapper(*args, **kwargs):
        l = f(*args, **kwargs)
        return len(l)
    return wrapper

# Helper function to filter data based on (data_type, data) tuple
@count
def filter_data(data, data_types):
    if not isinstance(data_types, list):
        data_types = [data_types]
    for data_type in data_types:
        data = list(filter(lambda student: data_type in student, data))
    return data

## Identify gender differences in the data
Count the number of females and males in the grouped data.

Gender
1. female
2. male

In [5]:
def get_females(data):
    return filter_data(data, ('Gender', '1'))


def get_males(data):
    return  filter_data(data, ('Gender', '2'))

print('Number of females: ', get_females(grouped_data))
print('Number of males: ', get_males(grouped_data))

Number of females:  76
Number of males:  49


## Collect drinking statistics
How many people associate 'drink' with orange juice and soda, respectively?

drink: Which picture do you associate with the word drink?
1. Orange juice
2. Soda

In [6]:
def drink_stats(data):
    return {'orange_juice': filter_data(data, ('drink', '1')),
            'soda': filter_data(data, ('drink', '2'))}

print('Drink preferences: ', drink_stats(grouped_data))

Drink preferences:  {'orange_juice': 54, 'soda': 69}


## Collect exercise statistics
How many females and males agree upon a specific pattern of exercising in a week? 

exercise: How often do you exercise in a regular week?
1. Everyday
2. Twice or three times per week
3. Once a week
4. Sometimes
5. Never

In [7]:
def exercise_stats(data, exercise_pattern):
    if exercise_pattern < 1 or exercise_pattern > 5:
        print("exercise pattern has to be a number between 1 and 5")
    return {'females': filter_data(data, 
                                   [('Gender', '1'), 
                                    ('exercise', str(exercise_pattern))]),
            'males': filter_data(data, 
                                   [('Gender', '2'), 
                                    ('exercise', str(exercise_pattern))])}

for n_days in range(1, 6):
    print("exercise ", n_days, ": ", exercise_stats(grouped_data, n_days))

exercise  1 :  {'females': 29, 'males': 28}
exercise  2 :  {'females': 29, 'males': 15}
exercise  3 :  {'females': 7, 'males': 4}
exercise  4 :  {'females': 0, 'males': 0}
exercise  5 :  {'females': 0, 'males': 0}


## Associaton of fries with McDonald's fries
How many students associate the word fries with McDonald's fries?

fries: Which picture do you associate with the word fries?
1. McDonald’s fries
2. Homefries

In [8]:
mc_donalds_fries = lambda data: filter_data(data, ('fries', '1'))
mc_donalds_fries(grouped_data)

114

## Students maintaining academics and jobs
How many students secure a higher than 3 GPA and earn more than $70,000?

* GPA
* income:
  * 1 - less than \\$15,000 USD
  * 2 - \\$15,001 to \\$30,000
  * 3 - \\$30,001 to \\$50,000
  * 4 - \\$50,001 to \\$70,000
  * 5 - \\$70,001 to \\$100,000
  * 6 - Higher than \\$100,000

In [9]:
bright_students = lambda data: list(filter(lambda student: 
                                           student[0][1].replace('.', '').isnumeric()
                                           and
                                           float(student[0][1]) > 3 and 
                                           int(student[5][1]) >= 5,
                                           data))

students = bright_students(grouped_data)
students

[[('GPA', '3.3'),
  ('Gender', '1'),
  ('drink', '1'),
  ('exercise', '2'),
  ('fries', '1'),
  ('income', '6'),
  ('sports', '2'),
  ('weight', "I'm not answering this. ")],
 [('GPA', '3.2'),
  ('Gender', '1'),
  ('drink', '2'),
  ('exercise', '3'),
  ('fries', '2'),
  ('income', '6'),
  ('sports', '2'),
  ('weight', 'Not sure, 240')],
 [('GPA', '3.5'),
  ('Gender', '1'),
  ('drink', '2'),
  ('exercise', '1'),
  ('fries', '1'),
  ('income', '6'),
  ('sports', '1'),
  ('weight', '190')],
 [('GPA', '3.3'),
  ('Gender', '1'),
  ('drink', '2'),
  ('exercise', '2'),
  ('fries', '1'),
  ('income', '5'),
  ('sports', '2'),
  ('weight', '137')],
 [('GPA', '3.3'),
  ('Gender', '1'),
  ('drink', '1'),
  ('exercise', 'nan'),
  ('fries', '1'),
  ('income', '5'),
  ('sports', '2'),
  ('weight', '180')],
 [('GPA', '3.904'),
  ('Gender', '1'),
  ('drink', '1'),
  ('exercise', '1'),
  ('fries', '1'),
  ('income', '5'),
  ('sports', '1'),
  ('weight', '110')],
 [('GPA', '3.4'),
  ('Gender', '2'),
  ('

## Weight statistics
What is the number of females with the `x` most common weights present in the grouped data?
What is the number of males with the `x` least common weights present in the grouped data?

In [10]:
from collections import Counter

def weight_stats(data, x):
    females = list(filter(lambda student: 
                          ('Gender', '1') in student, data))
    males = list(filter(lambda student: 
                        ('Gender', '2') in student, data))
    female_weights = [int(student[7][1]) for student in females 
                      if student[7][1].replace('.', '').isnumeric()]
    male_weights = [int(student[7][1]) for student in males 
                    if student[7][1].replace('.', '').isnumeric()]
    print(female_weights, male_weights)
    most_common_female_weights = Counter(female_weights).most_common(x)
    least_common_male_weights = Counter(male_weights).most_common()[:-x-1:-1]
    return {'most_common_female_weights': most_common_female_weights,
            'least_common_male_weights': least_common_male_weights}
    
weight_stats(grouped_data, 5)

[155, 190, 190, 137, 180, 125, 116, 110, 123, 145, 135, 105, 125, 115, 128, 150, 150, 150, 170, 150, 120, 135, 100, 170, 113, 192, 140, 155, 155, 135, 118, 180, 140, 112, 125, 145, 130, 140, 140, 120, 150, 135, 130, 190, 170, 127, 167, 140, 190, 129, 135, 155, 165, 125, 160, 135, 130, 230, 125, 130, 165, 128, 200, 160, 129, 170, 170, 113, 140, 156, 180, 120, 135] [187, 180, 264, 185, 180, 170, 165, 175, 195, 185, 185, 160, 175, 180, 167, 205, 175, 140, 168, 145, 155, 150, 169, 185, 200, 265, 165, 175, 210, 140, 200, 200, 145, 155, 175, 260, 190, 165, 175, 184, 210, 185, 170, 138, 150, 185, 135]


{'least_common_male_weights': [(190, 1),
  (187, 1),
  (184, 1),
  (169, 1),
  (168, 1)],
 'most_common_female_weights': [(135, 7),
  (140, 6),
  (150, 5),
  (170, 5),
  (125, 5)]}

## Clean the dataset
Filter out all the records from data that have one single unanswered, unclear, or offensive response for a column.

* GPA
* Gender
  * 1 - female
  * 2 - male
* drink: Which picture do you associate with the word drink?
  * 1 - Orange juice
  * 2 - Soda
* exercise: How often do you exercise in a regular week?
  * 1 - Everyday
  * 2 - Twice or three times per week
  * 3 - Once a week
  * 4 - Sometimes
  * 5 - Never
* fries: Which picture do you associate with the word fries?
  * 1 - McDonald’s fries
  * 2 - Homefries
* income:
  * 1 - less than \\$15,000 USD
  * 2 - \\$15,001 to \\$30,000
  * 3 - \\$30,001 to \\$50,000
  * 4 - \\$50,001 to \\$70,000
  * 5 - \\$70,001 to \\$100,000
  * 6 - Higher than \\$100,000
* sports: Do you do any sports activity?
  * 1 - Yes
  * 2 - No
* weight: An open-ended question - What is your weight in pounds?

In [11]:
filtered_data[0]

['GPA', 'Gender', 'drink', 'exercise', 'fries', 'income', 'sports', 'weight']

In [12]:
def clean_gpa(f):
    def wrapper(data):
        data = f(data)
        cleaned_data = [data[0]]
        cleaned_data.extend(list(filter(lambda student: 
                                        (student[0].replace('.', '').isdecimal()), data)))
        return cleaned_data
    return wrapper

def clean_gender(f):
    def wrapper(data):
        data = f(data)
        cleaned_data = [data[0]]
        cleaned_data.extend(list(filter(lambda student: 
                                        student[1].isnumeric()
                                  and int(student[1]) in {1, 2}, data)))
        return cleaned_data
    return wrapper

def clean_drink(f):
    def wrapper(data):
        data = f(data)
        cleaned_data = [data[0]]
        cleaned_data.extend(list(filter(lambda student: 
                                        student[2].isnumeric()
                                  and int(student[2]) in {1, 2}, data)))
        return cleaned_data
    return wrapper

def clean_exercise(f):
    def wrapper(data):
        data = f(data)
        cleaned_data = [data[0]]
        cleaned_data.extend(list(filter(lambda student: 
                                        student[3].isnumeric()
                                  and int(student[3]) in range(1, 6), data)))
        return cleaned_data
    return wrapper

def clean_fries(f):
    def wrapper(data):
        data = f(data)
        cleaned_data = [data[0]]
        cleaned_data.extend(list(filter(lambda student: 
                                        student[4].isnumeric()
                                  and int(student[4]) in {1, 2}, data)))
        return cleaned_data
    return wrapper

def clean_income(f):
    def wrapper(data):
        data = f(data)
        cleaned_data = [data[0]]
        cleaned_data.extend(list(filter(lambda student: 
                                        student[5].isnumeric()
                                  and int(student[5]) in range(1, 7), data)))
        return cleaned_data
    return wrapper

def clean_sports(f):
    def wrapper(data):
        data = f(data)
        cleaned_data = [data[0]]
        cleaned_data.extend(list(filter(lambda student: 
                                        student[6].isnumeric()
                                  and int(student[6]) in {1, 2}, data)))
        return cleaned_data
    return wrapper

def clean_weight(f):
    def wrapper(data):
        data = f(data)
        cleaned_data = [data[0]]
        cleaned_data.extend(list(filter(lambda student: 
                                        student[7].isnumeric(), data)))
        return cleaned_data
    return wrapper
    

@clean_gpa
@clean_gender
@clean_drink
@clean_exercise
@clean_fries
@clean_income
@clean_sports
@clean_weight
def clean_data(data):
    return data

cleaned_data = clean_data(filtered_data)

In [13]:
cleaned_data

[['GPA', 'Gender', 'drink', 'exercise', 'fries', 'income', 'sports', 'weight'],
 ['2.4', '2', '1', '1', '2', '5', '1', '187'],
 ['3.654', '1', '2', '1', '1', '4', '1', '155'],
 ['3.5', '1', '2', '1', '1', '6', '1', '190'],
 ['2.25', '1', '2', '2', '1', '1', '2', '190'],
 ['3.8', '2', '1', '1', '1', '4', '1', '180'],
 ['3.3', '1', '2', '2', '1', '5', '2', '137'],
 ['3.3', '1', '1', '1', '1', '4', '1', '125'],
 ['3.5', '1', '2', '1', '1', '3', '1', '116'],
 ['3.904', '1', '1', '1', '1', '5', '1', '110'],
 ['3.4', '2', '2', '3', '1', '5', '1', '264'],
 ['3.6', '1', '2', '2', '1', '5', '1', '123'],
 ['3.1', '2', '2', '2', '1', '5', '1', '185'],
 ['4', '1', '1', '2', '1', '1', '2', '145'],
 ['3.6', '2', '2', '1', '1', '6', '1', '170'],
 ['3.4', '1', '1', '3', '1', '5', '2', '135'],
 ['3.3', '2', '2', '1', '1', '6', '1', '175'],
 ['3.7', '2', '1', '1', '2', '6', '1', '185'],
 ['3.7', '2', '1', '3', '1', '4', '1', '185'],
 ['2.8', '1', '1', '1', '1', '6', '1', '125'],
 ['3.7', '2', '1', '1', 