## Department of Computer Science, University of York
### DATA: Introduction to Data Science

## Task 1: Domain Analysis  (5 marks)

Given the business domain and the data overview presented (in the assessment paper), provide a brief description of

* the business problem and its significance to the relevant sector;
* the link between the business problem and the field of data science;
* the main areas of investigation; and
* potential ideas and solutions.


**Word Limit:** 300 words

**Write your answer here (text cell(s) to be used, as appropriate)**

In [None]:
### Write your answer here (code cell(s) to be used, as appropriate)



----
----


## Task 2: Database Design (25 marks)


Having understood the business domain, present a conceptual design in the form of an entity-relationship (ER) model that would be helpful in creating a database for the bank.

The bank data currently exists in the form of a csv file called *BankRecords.csv*, provided on VLE (path given in page 5, assessment paper). This file has all the existing records. The table available in the csv file is unnormalised. The information about its different columns is given in Tables 1 and 2 (in the assessment paper).

Following the standard principles of database normalisation, normalise the given table (*BankRecords.csv*) to a database schema that has minimum redundancies. Then, using the designed schema, create an SQLite database.

Your answer should include the SQL statements needed to accomplish this step. Your submission should also include the created SQLite database file.

Your answer should clearly cover the following:
* Any assumptions you are making about the given scenario;
* The designated keys, existing relationships, and identified functional dependencies;
* The steps followed and justifications for the decisions made.

**World Limit**: 500 words. This limit applies only to the explanations. There is no limit on any associated code/SQL statements or figures.

**Write your answer here (text cell(s) to be used, as appropriate)**

In [9]:
### Write your answer here (code cell(s) to be used, as appropriate)
import sqlite3

# todo not nulls
# todo constraints
# todo the following classifier fields could be made integers:
#     statementFrequency
#     loanStatus
#     transType
#     dispType
#     cardType
# todo jupyter sql highlighting

sql_create = """
CREATE TABLE Account (
    accountID INTEGER PRIMARY KEY,
    statementFrequency TEXT,  -- frequency
    creationDate TEXT
);
CREATE TABLE Loan (
    loanID INTEGER PRIMARY KEY,
    accountID INTEGER,
    loanDate TEXT
    loanAmount REAL,
    loanDuration INTEGER,
    loanPayments INTEGER,
    loanStatus TEXT,
    FOREIGN KEY (accountID) REFERENCES Account(accountID)
);
CREATE TABLE StandingOrder (
    orderID INTEGER PRIMARY KEY,
    accountID INTEGER
    bankTo STRING,
    accountTo INTEGER,
    orderAmount REAL,
    paymentType STRING,
    FOREIGN KEY (accountID) REFERENCES Account(accountID)
);
CREATE TABLE BankTransaction (  -- Transaction is a reserved word
    transID INTEGER PRIMARY KEY,
    accountID INTEGER,
    transDate TEXT,
    transType TEXT,
    operation TEXT,
    transAmount REAL,
    balance REAL,
    transDetail TEXT,
    partnerBank TEXT,
    partnerAccount INTEGER,
    FOREIGN KEY (accountID) REFERENCES Account(accountID)
);
CREATE TABLE Client (
    clientID INTEGER PRIMARY KEY,
    cityID INTEGER,  -- a1
    birthDate TEXT,  -- birth_number
    gender INTEGER,  -- birth_number
    FOREIGN KEY (cityID) REFERENCES City(cityID)
);
CREATE TABLE Disposition (
    dispID INTEGER PRIMARY KEY,
    accountID INTEGER,
    clientID INTEGER
    dispType TEXT,
    FOREIGN KEY (accountID) REFERENCES Account(accountID)
    FOREIGN KEY (clientID) REFERENCES Client(clientID)
);
CREATE TABLE CreditCard (
    cardID INTEGER PRIMARY KEY,
    dispID INTEGER,
    cardType TEXT,
    cardIssued TEXT,
    FOREIGN KEY (dispID) REFERENCES Disposition(dispID)
);
CREATE TABLE City (
    cityID INTEGER PRIMARY KEY,  -- a1
    cityName TEXT,
    region TEXT,
    inhabitants INTEGER,
    muns5 INTEGER,
    muns6 INTEGER,
    muns7 INTEGER,
    muns8 INTEGER,
    noAreas INTEGER,
    ratioUrban REAL,
    avgSalary REAL,
    unemployment1995 REAL,
    unemployment1996 REAL,
    entrepeneurs INTEGER,
    crimes1995 INTEGER,
    crimes1996 INTEGER  -- a16
);
"""

sqlf = "BankRecords.db"
con = sqlite3.connect(sqlf)
con.executescript(sql_create)
con.commit()
con.close()

----
----


## Task 3: Research Design (25 Marks)

Using the database designed in Task 2, design and implement **five** potential modelling solutions to achieve the aim of the Data Intelligence team. You need to provide clear justifications about the techniques selected in the context of the 'problem in hand'. Your design must consist of a combination of inferential statistics, supervised learning algorithms, and unsupervised learning algorithms, and include **at least one** of those techniques. Finally, your modelling solutions should be of sufficient complexity, combining information from multiple tables from the database built in Task 2, as appropriate. Your answer should clearly show the queries made to the database. If amendments are made to the database, the commands should be clearly included in your answer.

Your answer should clearly cover the following:
* Any assumptions you are making about the given scenario;
* Any data processing and data integrity steps you would undertake to make the data fit for purpose;
* Which technique(s) you would apply for each solution and why;
* An evaluation of the techniques applied in terms of the accuracy of their results (or any other suitable evaluation measure);
* Algorithmic parameters should be adequately stated and discussed;
* A discussion of ethical considerations arising from the solutions selected.

**World Limit**: 500 words. This limit applies only to the explanations. There is no limit on any associated code or figures.

**Write your answer here (text cell(s) to be used, as appropriate)**

In [None]:
### Write your answer here (code cell(s) to be used, as appropriate)


----
----

## Task 4: Experimental Results and Analysis (25 Marks)

Given the **five** modelling solutions implemented above, analyse, discuss and present your findings to the key stakeholders of the bank.

Your answer should clearly cover the following:
* Present your findings in a clear and concise manner;
* Discuss your results in the context of the selected solution;
* Discuss how these results can help the bank in performing customer risk assessment and establishing customer retention strategies;
* Present the limitations (if any) of your solutions in a clear and concise manner.

**World Limit**: 500 words. This limit applies only to the explanations. There is no limit on any associated code or figures.

**Write your answer here (text cell(s) to be used, as appropriate)**

In [None]:
### Write your answer here (code cell(s) to be used, as appropriate)


----
----

## Task 5: Conclusion (10 Marks)

Given the insights derived from Tasks 1-4, provide a conclusion that clearly covers the following:
* A summary of the main points;
* A discussion of the significance of your results;
* Any recommendation(s) resulting from your analysis;
* Any overall ethical considerations arising from the data analysis of this business domain.

**World Limit**: 300 words.

**Write your answer here (text cell(s) to be used, as appropriate)**

In [None]:
### Write your answer here (code cell(s) to be used, as appropriate)


----
----

## Overall Academic Quality (10 Marks)
10 marks are allocated for the clarity and cohesiveness of your answers (both text and code) across all tasks with appropriate, relevant and effective analysis and presentation of the results.

## Deliverables

You should submit the following to the submission point on the teaching portal:

1. the SQLite database produced in Task 2;
2. the completed Jupyter notebook (both .ipynb and HTML files) that also includes the SQL statements (Task 2), the research design and its implementation (Task 3), and the analysis and presentation of your results (Task 4);
3. any figures or diagrams that are included in your answers in the Jupyter notebook.

For each task where text is required, we have provided guidelines above on the suggested word counts. Exceeding the word count will result in any work beyond the word count being disregarded when assessing.