## COMP5712M: Programming for Data Science 

## Coursework 1

### @author: Nijat/rthv0616/202005660

This coursework is intended to test skills in defining algorithmic functions in Python, with an emphasis on types of function for processing, transforming and classifying. Completing this set of tasks should give you a very good grounding in the fundamental programming techiniques required for data science.



### **Do not import any module other than those already imported into this file**.

In this coursework, you will be using a limited set of library modules. This is to ensure you master the kinds of algorithms required rather than relying on pre-programmed functions provided by Python packages. In other words do not add `import` statements in any code you write for this assignment. And, of course, do not add extra packages to the
`import` statements that are already in the code.

### For the tasks

You are given a bare skeleton function with the function name, arguments and a dummy return value. You need to edit the functions so that for any input of the expected type, the result returned will fit the given specification.

You will be able to check that your functions are working correctly, by running a testing function defined in the file `Coursework1_tests.py`. You should download this from Minerva and put it in the same directory as `Coursework1.ipynb`.
    
**Note:** The tests carried out by the testing module provided in `Coursework1_tests.py` are similar to but **not the same** as those that will be used for the final grading. The functions you define need to satisfy the general requirements that are specified, otherwise you may lose marks when the final grading is done using other tests.
    
When answering the questions below, you can just modify the given template cell. You don't need to create a new cell for your definition. But make sure you do not alter the name of the function or the number and type of its arguments, otherwise the automatic testing/grading function will not work correctly.


### Importing the Testing Module
In order to run the function tests provided for you to check you code
you will need to first run the following cell, which imports the
testing module for this assignment. (As mentioned above, the tests used
for final grading of this assignment will be different from the specific
tests carried out by this module.)

In [1]:
from Coursework1_tests import do_tests, tests_version

tests_version()

(('Autograder:', 3.0), ('Tests:', '21 October 2025'))

## Q1. Using a Data File of English Words (10 marks)

For this task you will deal with a file contains contains nearly all common English words:

* english_words.txt

The following code initialises the global variable `ENGLISH_WORDS` to be a **set** of all the words in the file `english_words.txt`, to be used for the rest of the tasks. You will need to have that file in the same folder as this notebook file.

In [6]:
def get_english_words():
    with open("english_words.txt") as f:
         words = f.readlines()
    words = { word.strip() for
             word in words }
    return words

ENGLISH_WORDS = get_english_words()

You are recommended to use the `ENGLISH_WORDS` variable in the following questions, whenever you are required to perform a calculation involving all the words in `english_words.txt`. This is particularly important for a function that
needs to check the list of words many times. Reading information from a file typically takes more computational time than other operations, so if data can be stored in memory (e.g. as the value of a Python variable) this will usually be a lot more efficient than reading it from a file each time it is needed. (Of course, when dealing with very large datasets, it may not be possible to store the whole dataset in memory.)

### Q1 (a) Check whether a string is an English word (5 marks)

Write a function `is_english_word(s)`, which will test if its input string `s` is an English word, according to the file english_words.txt. This file contains a large number of English words, including all common words and many very rare words. Proper names are not included, and all words are given in lower case, with one word on each line of the file.
You need to download the file of English words. Note that it is not a program file and you do not need to edit it. 


Your function `is_english_word(s)` should take any string as its argument and return a Boolean value --- i.e. `True` or `False`. More specifically your function should return `True` if _any_ of the following conditions hold:
* The input string is one of the words in `english_words.txt`.
* The input string is the same as one of the of the words in `english_words.txt` except that the input string starts with a capital letter (with all the other letters being small).
* The input string is the same as one of the words in `english_words.txt` except that the input string is all in capital letters.
* The input string contains only alphabetic characters and is one of the 100 most
  common words in the English language. (One would expect all such words to be listed in `english_words.txt`, but maybe you should check.)

If none of these conditions hold, your function should return `False`.

#### Examples:

| INPUT |	OUPUT |
|-------|---------|
|`"python"`| `True`|
|`"Python"`| `True`|
|`"PYTHON"`| `True`|
|`"pyThon"`| `False`|
|`"splap"` | `False`|

In [65]:
## Modify this function definition to fulfill the given requirements.

def is_english_word(s):
    if not s.isalpha():
        return False
        
    if s in ENGLISH_WORDS:
        return True
        
    if s.lower() in ENGLISH_WORDS:
        if s[0].isupper() and s[1:].islower():
            return True
        if s.isupper():
            return True
        
    return False

In [67]:
# Run this cell to test your is_english_word function
# The testing module must have been imported (see above)
do_tests(is_english_word)

*Autograder (v3.0)*
Testing function: is_english_word
Evaluating: is_english_word("this") ...
Returned: True
Expected answer: True
CORRECT :)    1 mark
Evaluating: is_english_word("Python") ...
Returned: True
Expected answer: True
CORRECT :)    1 mark
Evaluating: is_english_word("HelP") ...
Returned: False
Expected answer: False
CORRECT :)    1 mark
Evaluating: is_english_word("Flibbertigibbet") ...
Returned: True
Expected answer: True
CORRECT :)    1 mark
Evaluating: is_english_word("Brexit") ...
Returned: False
Expected answer: False
CORRECT :)    1 mark
----------------------------------------------
Total mark for 'is_english_word' is 5 out of 5
----------------------------------------------


### Q1(b) Literate Password Checker (5 marks)

Computer systems are vulnerable to hacking if they can be
accessed by passwords based on English words.
Now that we have access to the set of English words, we can
define a tougher password checker.

An institution uses the following rules to classify the strength of passwords:

* A string is an ILLEGAL password if either:
  * it is an English word (as defined above)
  * it contains any invisible character (space, tab, newline)


* A string is a WEAK password if it is **not** ILLEGAL and, either:
  * it is _less than_ 8 characters long.
  * it is an English word followed by one or more decimal digit characters


* A string is a STRONG password if it is **not** ILLEGAL and:
  * it contains at least 12 characters
  * AND it contains at least 1 lower case letter
  * AND it contains at least 1 capital letter
  * AND it contains at least 1 numerical digit
  * AND it contains at least 1 special character (any visible ASCII
    character that is not a letter or a number)


* A string is a MEDIUM password if it is **not** an ILLEGAL password and
  is **not** a WEAK password and is **not** a STRONG password.

You need to code a function ```password_strength``` that will take a string argument and will return the 'strength' of that string as a password, according to the rules given above. So it should return one of the strings  
```"ILLEGAL"```, 
```"STRONG"```, 
```"WEAK"``` or 
```"MEDIUM"```.

You may assume that the input password is consists of only ASCII characters,
that is: alphabetic letters, numerical digits, special visible charactes,
spaces, tabs and newlines. The special visible characters are:
<pre>
~`!@#$%^&*(){}[]|\/:;";<>.?
</pre>


Examples:

| INPUT             | OUPUT |
| -----             | --------|
| ```"secret"```           |	```"ILLEGAL"``` |
| ```"my secret"```           |	```"ILLEGAL"``` |
| ```"qwertyu"```           |	```"WEAK"``` |
| ```"hello123"```           |	```"WEAK"``` |
| ```"7Kings8all9Pies!"``` |	```"STRONG"``` |
| ```"brandon123"```      |	```"MEDIUM"``` |


In [56]:
## Modify this function definition to fulfill the given requirements.

def password_strength(password):
    ## Add code to compute password strength
    # should return the strength ("ILLEGAL", WEAK", "STRONG" or "MEDIUM")

    special_chars = '~`!@#$%^&*(){}[]|\/:;";<>.?'

    # ---------------illegal
    if is_english_word(password):
        return 'ILLEGAL'
    for c in [' ', '\t', '\n']:
        if c in password:
            return 'ILLEGAL'

    # -----------weak
    if len(password) < 8:
        return 'WEAK'

    for i in range(1, len(password)):
        if password[:i].isalpha() and password[i:].isdigit():
            if is_english_word(password[:i]):
                return 'WEAK'

    # ---------strong
    if len(password) >= 12:
        has_lower = False
        has_upper = False
        has_digit = False
        has_special = False

        for c in password:
            if c.islower():
                has_lower = True
            elif c.isupper():
                has_upper = True
            elif c.isdigit():
                has_digit = True
            elif c in special_chars:
                has_special = True

        if has_digit and has_lower and has_special and has_upper:
            return 'STRONG'

    return 'MEDIUM'
    
    #return None


  special_chars = '~`!@#$%^&*(){}[]|\/:;";<>.?'


In [58]:
# Run this cell to test the password_strength function
# The testing module must have been imported (see above)
do_tests(password_strength)

*Autograder (v3.0)*
Testing function: password_strength
Evaluating: password_strength("boa constrictor") ...
Returned: 'ILLEGAL'
Expected answer: 'ILLEGAL'
CORRECT :)    1 mark
Evaluating: password_strength("Secret") ...
Returned: 'ILLEGAL'
Expected answer: 'ILLEGAL'
CORRECT :)    1 mark
Evaluating: password_strength("secret99") ...
Returned: 'WEAK'
Expected answer: 'WEAK'
CORRECT :)    1 mark
Evaluating: password_strength("Secret999!") ...
Returned: 'MEDIUM'
Expected answer: 'MEDIUM'
CORRECT :)    1 mark
Evaluating: password_strength("7Kings8all9Pies!") ...
Returned: 'STRONG'
Expected answer: 'STRONG'
CORRECT :)    1 mark
------------------------------------------------
Total mark for 'password_strength' is 5 out of 5
------------------------------------------------


## Question 2: A Simple Holiday Recommendation System (8 marks)

You are asked to write functions that operate as a
simple system for recommending holiday destinations based on the amount
of money to spend and some desired features of a holiday.
These requirements will be compared to data about possible holiday
destinations in order to find those that match the cost and
feature requirements.

### Q2(a) Find available holiday features (3 marks)
Write a function `available_features` that
takes two input parameters:
 1. The maximum amount of money (represented as an integer) that someone is prepared to spend.
 2. Holiday destination datalist.<br>
    This is a dictionary in the form such as specified in the next code
    cell. It contains the name of a holiday destination as a key and a list of the cost of the holiday and a list of attributes associated with that destination.

In [107]:
HOLIDAYS_EG = { "Scarborough": [45, ["beach"]], 
                "Whitby": [60,  ["beach", "culture"]],
                "Barcelona": [320,  ["beach", "culture", "hot"]], 
                "Corfu": [300,  ["beach", "hot"]],
                "Paris": [250,  ["culture"]],
                "Rome":[300,  ["culture", "hot"]],
                "Switzerland": [450,  ["culture", "mountains"]],
                "California": [750,  ["beach", "hot", "mountains"]],      
                }      

**NOTE:** `HOLIDAYS_EG` is just an example of the holiday data structure. **Your
function should work with any similar structure**, which could have different
destinations, costs and attribute lists (which may contain other attribute
strings).

The value returned by `available_features` should be a list of all possible holiday features that
are available from any holiday destination whose cost is less than or equal to the maximum amount
specified by input parameter 1. This list should be ordered in _alphabetical order_.

Here are some examples shown as a table:


 | parmeter 1 (cost) | parameter 2 (destination data) | return (feature list)|
 | --     |  --      | -- |  
 |   100  |   `HOLIDAYS_EG`    | `["beach", "culture"]` |
 |   300  |   `HOLIDAYS_EG`    | `["beach", "culture", "hot"]`  |
 
 
**NOTE:** It need not be possible to have a holiday with all the possible features within the
 price limit, even though each feature must be available for some destination within the price limit.
 This means that it's not necessary to have a destination that gives all the returned features, provided that the features can be obtained through a combination of destinations within the price.

In [110]:
## Edit this function definition to give your answer
def available_features(max_cost, holiday_data):
    ## add your code here
    features = set() # remove duplicates
    for destination, info in holiday_data.items():
        cost = info[0]
        attrs = info[1]

        if cost <= max_cost:
            for feature in attrs:
                features.add(feature)

    return sorted(features)
   # return ["python"] # not really
    # It should actually return list of holiday features

In [112]:
# Run this cell to test the available_features function
# The testing module must have been imported (see above)
do_tests(available_features)

*Autograder (v3.0)*
Testing function: available_features
Evaluating: available_features(30, HOL_TEST) ...
Returned: []
Expected answer: []
CORRECT :)    1 mark
Evaluating: available_features(100, HOL_TEST) ...
Returned: ['beach', 'culture']
Expected answer: ['beach', 'culture']
CORRECT :)    1 mark
Evaluating: available_features(5000, HOL_TEST) ...
Returned: ['beach', 'culture', 'hot', 'wildlife']
Expected answer: ['beach', 'culture', 'hot', 'wildlife']
CORRECT :)    1 mark
-------------------------------------------------
Total mark for 'available_features' is 3 out of 3
-------------------------------------------------


In [105]:
# extra test
HOLIDAYS_TEST = { "Brighton": [150,  ["beach", "culture"]], 
                  "Whitby": [100,  ["beach", "culture"]],
                  "Barcelona": [320,  ["beach", "culture", "hot"]],
                  "Doncaster": [40,  []],
                  "Crete": [300,  ["beach", "hot"]],
                  "London": [250,  ["culture"]],
                  "Sicily": [300,  ["culture", "hot", "beach"]],
                  "Barbados": [1250,  ["hot", "beach"]],
                  "Tanzania": [2500, ["hot", "beach", "wildlife"]],
                  "Galapagos Islands": [4500,  ["beach", "wildlife"]],      
                }

print(available_features(100, HOLIDAYS_TEST))

['beach', 'culture']


In [93]:
HOL_FINAL = {
                  "Brighton": [200,  ["beach", "culture"]], 
                  "Whitby": [100,  ["beach", "culture"]],
                  "Barcelona": [320,  ["beach", "culture", "hot"]],
                  "Doncaster": [40,  []],
                  "Scotland": [250,  ["mountains", "midges"]],
                  "Crete": [300,  ["beach", "hot"]],
                  "Paris": [250,  ["culture", "food"]],
                  "Rome": [325,  ["hot", "culture", "food"]],
                  "Sicily": [350,  ["culture", "hot", "beach", "food"]],
                  "Crete": [400,  ["culture", "hot", "beach"]],
                  "Trinidad": [1000,  ["beach", "hot"]],
                  "Barbados": [1250,  ["hot", "beach"]],
                  "Tanzania": [2500,  ["hot", "beach", "wildlife"]],
                  "Galapagos Islands": [4500,  ["beach", "wildlife"]],  
                  "Switzerland": [1500,  ["mountains", "culture"]],  
                  "Antarctica": [3250,  ["wildlife"]]
             }

### Q2(b) Find possible holiday recommendations (5 marks)
Write a function `recommend_holidays` that
takes three inputs:
 1. The maximum amount of money (represented as an integer) that someone is prepared to spend.
 2. A list of attribute strings. These can be any strings but would normally
    correspond to attributes that have been specified in the holiday destination
    data parameter.
 3. Holiday destination datalist, which is of the same for as specified for part **(a)**.
    



The value returned by `recommend_holidays` should be a list, in descending order of cost, for destinations that satisfy the requirements, i.e. cost less than or equal to the  maximum cost input and have **all** the features indicated by the desired features,  string.

Here are some examples shown as a table:


 | max_cost | attributes | holiday_data | return|
 | -- | -- | -- | -- |  
 | 500 | `["beach", "culture"]` | `HOLIDAYS_TEST` | ```[[['Barcelona', 320, ['beach', 'culture', 'hot']], ['Sicily', 300, ['culture', 'hot', 'beach']], ['Brighton', 150, ['beach', 'culture']], ['Whitby', 100, ['beach', 'culture']]]``` |
 | 200 | `["beach"]` | `HOLIDAYS_TEST` | ```[['Brighton', 150, ['beach', 'culture']], ['Whitby', 100, ['beach', 'culture']]]``` |
 | 2500 | `["wildlife"]` | `HOLIDAYS_TEST` | ```[['Tanzania', 2500, ['hot', 'beach', 'wildlife']]]``` |
 
 
Note that the second argument can be any list of strings. However, if it contains any string that is
not a feature of any destination in the destination data, the value returned will be 
the empty list, `[]`, since no destinations will match that requirement.

In [125]:
## Edit this function definition to give your answer
def recommend_holidays(max_cost, attributes, holiday_data):
    ## add your code here
    recommendations = []

    for destination, info in holiday_data.items():
        cost = info[0]
        features = info[1]

        if cost <= max_cost and all(attr in features for attr in attributes):
            recommendations.append([destination, cost, features])

    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations
    
    #return ["Leeds"] # not really
    # It should actually return list of all destinations that fit the requirements

In [127]:
# Run this cell to test the recommend_holidays function
# The testing module must have been imported (see above)
do_tests(recommend_holidays)

*Autograder (v3.0)*
Testing function: recommend_holidays
Evaluating: recommend_holidays(200, ["beach", "hot"], HOL_TEST) ...
Returned: []
Expected answer: []
CORRECT :)    1 mark
Evaluating: recommend_holidays(200, ["beach"], HOL_TEST) ...
Returned: [['Brighton', 150, ['beach', 'culture']], ['Whitby', 100, ['beach', 'culture']]]
Expected answer: [['Brighton', 150, ['beach', 'culture']], ['Whitby', 100, ['beach', 'culture']]]
CORRECT :)    1 mark
Evaluating: recommend_holidays(500, ["beach"], HOL_TEST) ...
Returned: [['Barcelona', 320, ['beach', 'culture', 'hot']], ['Crete', 300, ['beach', 'hot']], ['Sicily', 300, ['culture', 'hot', 'beach']], ['Brighton', 150, ['beach', 'culture']], ['Whitby', 100, ['beach', 'culture']]]
Expected answer: [['Barcelona', 320, ['beach', 'culture', 'hot']], ['Crete', 300, ['beach', 'hot']], ['Sicily', 300, ['culture', 'hot', 'beach']], ['Brighton', 150, ['beach', 'culture']], ['Whitby', 100, ['beach', 'culture']]]
CORRECT :)    1 mark
Evaluating: recommend

## Question 3: CO<sub>2</sub> and Greenhouse Gas Emission (5 marks)

You are asked to write functions to extract specific data from a `CSV` file using the Pandas data analysis package for Python.The dataset contains annual CO<sub>2</sub> emissions per capita for multiple countries spanning the period 1750-2022, with differing start years by country.

Full details for each task are explained below.To complete these tasks you will need to access and filter a `DataFrame`. The following questions can be done with only a small but powerful set of `DataFrame` operations.

### Q3(a) Find all countries without country code (1 mark)

As you can see in the CSV file, there are countries without code. You need to write a function `no_code_countries()` which return all countries without country code as a set.

Here is an example of returned results.

<code>
{'Africa',
 'Asia',
 'Asia (excl. China and India)',
 ...,
 ...,
 'Oceania',
 'South America',
 'Upper-middle-income countries'}
</code>

In [3]:
import pandas as pd
GHE_DF = pd.read_csv("co-emissions-per-capita.csv")

In [45]:
#GHE_DF.head
#print(GHE_DF.columns)

In [47]:
## Edit this function definition to give your answer
def no_code_countries():
    no_code_rows = GHE_DF[GHE_DF['Code'].isna()]['Entity']

    return set(no_code_rows)
    #return None

In [49]:
# Run this cell to test the no_code_countries function
# The testing module must have been imported (see above)
do_tests(no_code_countries)

*Autograder (v3.0)*
Testing function: no_code_countries
Evaluating: no_code_countries() ...
Returned: {'Europe', 'Lower-middle-income countries', 'European Union (28)', 'European Union (27)', 'Oceania', 'North America', 'Upper-middle-income countries', 'Europe (excl. EU-27)', 'North America (excl. USA)', 'Africa', 'High-income countries', 'Asia', 'Low-income countries', 'South America', 'Asia (excl. China and India)', 'Europe (excl. EU-28)'}
Expected answer: {'Europe', 'European Union (28)', 'Europe (excl. EU-27)', 'Africa', 'High-income countries', 'Low-income countries', 'Europe (excl. EU-28)', 'North America (excl. USA)', 'Lower-middle-income countries', 'European Union (27)', 'Oceania', 'North America', 'Asia', 'South America', 'Asia (excl. China and India)', 'Upper-middle-income countries'}
CORRECT :)    1 mark
------------------------------------------------
Total mark for 'no_code_countries' is 1 out of 1
------------------------------------------------


### Q3(b) Find countries and years with 0 (zero) annual CO<sub>2</sub> emission (1 mark)

Write a function `missing_co2data_dataframe` that returns a dataframe of the countries and years with 0 (zero) annual CO<sub>2</sub> emission.

Here are some expected output:
<code>
      Entity	Year
74	Africa	1750
75	Africa	1760
76	Africa	1770
77	Africa	1780
78	Africa	1790
...	...	...
25336	Upper-middle-income countries	1852
25337	Upper-middle-income countries	1853
25338	Upper-middle-income countries	1854
25340	Upper-middle-income countries	1856
25341	Upper-middle-income countries	1857

2178 rows × 2 columns
</code>


In [79]:
## Edit this function definition to give your answer
def missing_co2data_dataframe():
    zero_emissions = GHE_DF[GHE_DF['Annual CO2 emissions (per capita)'] == 0]

    return zero_emissions[['Entity', 'Year']]
    
    # return None

In [81]:
# Run this cell to test the missing_co2data_dataframe function
# The testing module must have been imported (see above)
do_tests(missing_co2data_dataframe)

*Autograder (v3.0)*
Testing function: missing_co2data_dataframe
Evaluating: missing_co2data_dataframe().head(10) ...
Returned:     Entity  Year
74  Africa  1750
75  Africa  1760
76  Africa  1770
77  Africa  1780
78  Africa  1790
79  Africa  1800
80  Africa  1801
81  Africa  1802
82  Africa  1803
83  Africa  1804
Expected answer:     Entity  Year
74  Africa  1750
75  Africa  1760
76  Africa  1770
77  Africa  1780
78  Africa  1790
79  Africa  1800
80  Africa  1801
81  Africa  1802
82  Africa  1803
83  Africa  1804
CORRECT :)    1 mark
Evaluating: missing_co2data_dataframe().tail(10) ...
Returned:                               Entity  Year
25330  Upper-middle-income countries  1846
25331  Upper-middle-income countries  1847
25332  Upper-middle-income countries  1848
25333  Upper-middle-income countries  1849
25335  Upper-middle-income countries  1851
25336  Upper-middle-income countries  1852
25337  Upper-middle-income countries  1853
25338  Upper-middle-income countries  1854
25340  Uppe

### Q3(c) Find total CO<sub>2</sub> emission for a country over a year range (3 marks)

Write a function `total_emission_over_years(country, year_range)` that takes two parameters:
1. the country with the exact name as those in the dataset
2. the year range as a list in the format [year_start, year_end], e.g. [1950, 1960].

The function returns the total CO<sub>2</sub> emission as a float.

Here are some examples shown as a table:


 | country | year_range | return|
 | -- | -- | -- | 
 | `Africa` | `[2012, 2022]` | 11.85415807 |
 | `World` | `[2012, 2020]` | 42.8247932 |
 | `United States` | `[2012, 2022]` | 175.4723575 | 

In [103]:
## Edit this function definition to give your answer
def total_emission_over_years(country, year_range):
    str_year, end_year = year_range

    filtering = GHE_DF[(GHE_DF['Entity'] == country) & (GHE_DF['Year'] >= str_year) & (GHE_DF['Year'] <= end_year)]

    total = filtering['Annual CO2 emissions (per capita)'].sum()

    return float(total)
    
   # return float(0.00)  # convert to float

In [105]:
# Run this cell to test the total_emission_over_years function
# The testing module must have been imported (see above)
do_tests(total_emission_over_years)

*Autograder (v3.0)*
Testing function: total_emission_over_years
Evaluating: total_emission_over_years("Africa", [2012, 2022]) ...
Returned: 11.85415807
Expected answer: 11.85415807
CORRECT :)    1 mark
Evaluating: total_emission_over_years("World", [2012, 2020]) ...
Returned: 42.8247932
Expected answer: 42.8247932
CORRECT :)    1 mark
Evaluating: total_emission_over_years("United States",[2012, 2022]) ...
Returned: 175.4723575
Expected answer: 175.4723575
CORRECT :)    1 mark
--------------------------------------------------------
Total mark for 'total_emission_over_years' is 3 out of 3
--------------------------------------------------------


# Submission

* Coursework should be submitted via Gradescope.
* You should submit your edited version of this file.
* **Do not change the name of the file. 
  The file you upload must have the same name i.e. Coursework1.ipynb**