<a href="https://colab.research.google.com/github/p-tech/wbs-dm/blob/main/SQLite_DB_Exercise_GH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**STEP 1: CREATE the SQLite database;**


We need to import the sqlite3 module and create the database and tables.  You'll see this follows the syntax we have used on previous weeks.

Note that we have created the student table with a primary key that is not an INTEGER.

Is this good practice?  
What are the issues and benefits of doing this?

In [None]:
import sqlite3

#This statement creates a connection labelled as conn.  This will be used throughout to ensure the consistency for when we start to query the database tables.
conn = sqlite3.connect('student_grades.db')
cursor = conn.cursor()

#create the student table - we've set the ID to be a primary key.  Is it good to create the primary key as an TEXT string.
cursor.execute('''
CREATE TABLE IF NOT EXISTS student (
  ID TEXT PRIMARY KEY,
  First TEXT NOT NULL,
  Last TEXT NOT NULL
)
''')

#create the grade table - no primary key provided.  As students can exist multiple times in the table as can a course.
cursor.execute('''
CREATE TABLE IF NOT EXISTS grade (
  ID TEXT,
  Code TEXT NOT NULL,
  Mark INTEGER NOT NULL
)
''')

#create the course table - primary key provided again set as TEXT.
cursor.execute('''
CREATE TABLE IF NOT EXISTS course (
  Code TEXT PRIMARY KEY,
  Title TEXT NOT NULL
)
''')

#This saves the chnages to the databae.  Up unitl this point the executed SQL statement isn't stored, changes are not immediatley saved.
conn.commit()
#conn.close()
print("Database and tables created successfully!")


**STEP 2: Check Tables Created:**

Run the command to show the database tables created and the structure.

In [None]:
# prompt: show the table structures

#import sqlite3

#conn = sqlite3.connect('student_grades.db')
#cursor = conn.cursor()

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

for table_name in tables:
    print(f"Table: {table_name[0]}")
    cursor.execute(f"PRAGMA table_info({table_name[0]});")
    columns = cursor.fetchall()
    for col in columns:
        print(f"  Column: {col[1]}, Type: {col[2]}, NotNull: {col[3]}, DefaultVal: {col[4]}, PrimaryKey: {col[5]}")
    print("-" * 20)

#conn.close()


**STEP 3: Upload Files:**

Run this box three times to upload the relevant csv files.

Course_Table.csv, Student_Table.csv & Grade_table.csv

In [None]:


from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))


**STEP 4: Load CSV files into the database tables:**

This will populate the database tables with the data from teh csv files.  No need to write INSERT statements.

You need to make sure the correct files are loaded into the corresponding tables.

In [None]:

def import_csv_to_table(csv_file, table_name):
    #opens the file aas read only 'r', doesn't allow the origianl csv to be changed.
    with open(csv_file, 'r', encoding='utf-8') as file:
        csv_reader = csv.reader(file)
        next(csv_reader)  # Skip header row if present
        for row in csv_reader:
            #? creates a placeholder for each column in the CSV file. ['?','?','?'] - Join makes it a string so it can then be inserted.
            # use of the '?' reduce risk of SQL injection
            placeholders = ', '.join(['?' for _ in row])
            #Assumes that the CSV and table have the same structure (this could be an issue) Would have to specify column names if different.
            sql = f"INSERT INTO {table_name} VALUES ({placeholders})"
            cursor.execute(sql, row)

# Import data from CSV files into the relevant table - Student_Table goes into student table.  teh import_csv_to_table is the function, passing the two values across.
try:
    import_csv_to_table('Student_Table.csv', 'student')
    import_csv_to_table('Course_Table.csv', 'course')
    import_csv_to_table('Grade_Table.csv', 'grade')
    conn.commit()
    print("Data imported successfully!")
except Exception as e:
    print(f"An error occurred: {e}")
    conn.rollback()  # Rollback changes if an error occurred



Data imported successfully!


**STEP 5: Check Data has loaded**

Query each database table and load the data into a dataframe and display the first 5 lines

In [None]:
# Query all three tables and load into pandas DataFrames
student_df = pd.read_sql_query("SELECT * FROM student", conn)
grade_df = pd.read_sql_query("SELECT * FROM grade", conn)
course_df = pd.read_sql_query("SELECT * FROM course", conn)

# Show the first 5 lines of each DataFrame
print("Student Table:")
print(student_df.head(5))
print("\nGrade Table:")
print(grade_df.head(5))
print("\nCourse Table:")
print(course_df.head(5))




**ONLY RUN IF YOU NEED TO DELETE THE DATA IN THE TABLES**

If you run go back to **STEP 4** and re-run from there.

In [None]:
# only run if you need to reset the tables without deleting the databae and starting again - then re-run the box previous box.
# Delete all data from the tables
cursor.execute("DELETE FROM student")
cursor.execute("DELETE FROM grade")
cursor.execute("DELETE FROM course")

conn.commit()
print("All data deleted from the tables successfully!")



All data deleted from the tables successfully!


**STEP 6: SQL Select statements**

Run the following statements.  Please ask yoursefl the impact of each one before running.




In [None]:
# Select all columns from the student table
student_df = pd.read_sql_query("SELECT * FROM student", conn)

#What should the output be.
print(student_df)



In [None]:
# Select Last from the student table
studentLast_df = pd.read_sql_query("SELECT ALL Last FROM student", conn)

#What should the output be.
print(studentLast_df)



In [None]:
# Select DISTINCT last names from the student table
studentLastUnique_df = pd.read_sql_query("SELECT DISTINCT Last FROM student", conn)

#What should the output be.   What does this tell you from the previous outputs?
print(studentLastUnique_df)



In [None]:
# Select DISTINCT First from the student table - modify the query
studentFirstUnique_df = pd.read_sql_query("SELECT DISTINCT <INSERT> FROM student", conn)

#What should the output be?  What does this tell you from the previous outputs?
print(studentXXXXXUnique_df)

**STEP 7: SELECT with WHERE**

In [None]:
# Select Last from the student table
studentWhere_df = pd.read_sql_query("SELECT * FROM grade WHERE Mark > 60", conn)

#What should the output be.
print(studentWhere_df)

In [None]:
#export the dataframe to csv for further analysis
#the index=False means no row numbers are exported.
#change index to True and compare the outputs.  Youll need to download formthe Files window

studentWhere_df.to_csv('Grades_Over_60.csv', index=True)


**TASK**

Create a statement to select all students that have passed the Finance Management Course and export the file to CSV.

Just using standard select statments.

How would you tackle the problem.

In [None]:
#review the structure of the grades table and the course table
print("\nGrade Table:")
print(grade_df.head(5))
print("\nCourse Table:")
print(course_df.head(5))

In [None]:
#select the course code from the course table for the Finance Management Course
studentCourse_df = pd.read_sql_query("SELECT <WHAT> FROM <TABLE> WHERE <FIELD> = '<VALUE>'", conn)

print(studentCourse_df)

In [None]:
#select student ID and Mark for the Finance Management Course
studentCourseMark_df = pd.read_sql_query("SELECT <WHAT> FROM <TABLE> WHERE <FIELD> = '<VALUE>'", conn)

print(studentCourseMark_df)

**STEP 8: Select from multiple tables in one statement**

This can cause issues where tables have hte same column names in different tables.  

To resolve this we need to make use of the following syntax:
TableName.Column

In [None]:
#select the following coloumns:  First, Last & Mark from the Student and Grade tables
studentMarks_df = pd.read_sql_query("SELECT First, Last, Mark FROM student, grade WHERE (student.ID = grade.ID)", conn)

print(studentMarks_df)


**TASK**

Modify the statement to get all the students that acheived a mark over 50

In [None]:
#select the following coloumns:  First, Last & Mark from the Student and Grade tables
studentMarks_df = pd.read_sql_query("SELECT First, Last, Mark FROM student, grade WHERE (Student.ID = Grade.ID)", conn)

print(studentMarks_df)

In [None]:
#select the following coloumns:  First, Last & Mark from the Student and Grade tables
studentMarks_df = pd.read_sql_query("<INSERT QUERY>", conn)

print(studentMarks_df)

In [None]:
#modify the query to gather the 'Finance Management Course and all studnet that obtained a mark between 55 and 65
studentMarks_df = pd.read_sql_query("<INSERT QUERY>", conn)

print(studentMarks_df)

**STEP 9: ORDER BY statements**

Can be set to be either ASC or DESC.  The syntax is ORDER BY added to the select statement.

In [None]:
studentMarks_df = pd.read_sql_query("SELECT * FROM grade ORDER BY Mark DESC", conn)

print (studentMarks_df)

**TASK**

Modify the statement to order by Course Code and then Mark

In [None]:
studentMarks_df = pd.read_sql_query("SELECT * FROM grade ORDER BY Mark DESC", conn)
print (studentMarks_df)

Modify the statement to order by Code, Mark and obtain the student Name.  Only show Name and Mark for Course Id

In [None]:
studentMarks_df = pd.read_sql_query("SELECT * FROM grade ORDER BY Mark DESC", conn)
print (studentMarks_df)

Modify the statement to select for a specific course and display the course code and title, order by the grade ASC - Statistics For Python

In [None]:
studentMarks_df = pd.read_sql_query("SELECT First, Last, Mark, Course.Code, Title FROM grade, student, course WHERE (student.id = grade.id) AND (course.title = 'Statistics For Python') ORDER BY Mark ASC", conn)
print (studentMarks_df)

**STEP 10: Arthimatic and Aggregating functions**

Simple arithmatic on the columns.  

When running we make use of AS - to make better readibility for the column name.

In [None]:
#simple Adding values, Subtracting Values, Multiplyig Values:
studentCount_df = pd.read_sql_query("SELECT Mark, Mark/2 AS DIVIDED, Mark*2 AS DOUBLED, MArk+10 AS MODERATED FROM grade", conn)
print (studentCount_df)

In [None]:
#simple Count:  Count up the rows:
studentCount_df = pd.read_sql_query("SELECT COUNT(*) AS COUNT FROM grade", conn)
print (studentCount_df)

In [None]:
#simple SUM for the column values
studentCount_df = pd.read_sql_query("SELECT SUM(mark) AS TOTAL FROM grade", conn)
print (studentCount_df)

In [None]:
#simple MAX for the column values
studentCount_df = pd.read_sql_query("SELECT MAX(mark) AS BEST FROM grade", conn)
print (studentCount_df)

#modify to print the Lowest Mark

**TASK**

How would you calculate the RANGE of values in the marks column.

In [None]:
studentCount_df = pd.read_sql_query("SELECT <WHAT> AS <WHAT> FROM <WHERE>", conn)
print (studentCount_df)

How would you find a specific students marks for all modules taken and the average mark.

The student is Laura Smith



In [None]:
studentMarks_df = pd.read_sql_query("SELECT <WHAT> AS <WHAT> FROM <WHERE>", conn)
print (studentMarks_df)

**STEP 11: Group BY Statements**

When we need to pull a group of rows together and carry out an aggregation of data.

In [None]:
studentGroup_df = pd.read_sql_query("SELECT AVG(Mark), Code AS AVERAGE FROM grade GROUP BY Code", conn)
print (studentGroup_df)

**TASK**

Create a query to show the First, Last name of the student and there average grade across all modules.

In [None]:
studentGroup_df = pd.read_sql_query("<INSERT YOUR QUERY>", conn)
print (studentGroup_df)

#modify the query to order by AVERAGE grade DESC

We can use HAVING to repalce a WHERE when we have items that have been grouped together.

In [None]:
#using your above query identify stuent that have an AVG of more then 65
studentGroup_df = pd.read_sql_query("<INSERT QUERY HERE>", conn)
print (studentGroup_df)

#how would you change this so it's between 52 AND 58

**STEP 12: Using JOINS**

Cross JOIN A & B - retunrs all pairs of rows from A and B

Natural JOIN A & B - returns pairs of rwos with comon values for idnetical names columns and without dupilcating columns

Inner JOIN A & B - returns pairs of rows satisfying a condition

In [None]:
#CROSS JOIN
studentJoin_df = pd.read_sql_query("SELECT * FROM student CROSS JOIN grade", conn)
print (studentJoin_df)

In [None]:
#NATURAL JOIN
studentJoin_df = pd.read_sql_query("SELECT * FROM student NATURAL JOIN grade", conn)
print (studentJoin_df)

In [None]:
#INNER JOIN
studentJoin_df = pd.read_sql_query("SELECT * FROM student INNER JOIN grade USING (ID)", conn)
print (studentJoin_df)