## Thesis Question: "Suburbs with higher density yield better public transport usage"

This is my thesis question, and it outlines my hypothesis, which states that Australian suburbs, such as Hornsby, Woy Woy, Sydney CBD and Epping, will be able to achieve better public transport patronage with train and bus stations with higher density

### Hypothesis:
- Suburbs with higher density, meaning more people live in the same or smaller area, generally will yield better results for train patronage. This means that public transport, specifically referring to suburban trains and metros, will generally receive better patronage. This hypothesis is based on the reasoning that a higher population situated around the train station will allow for a higher population to use the station within walking distance. Furthermore, areas with higher density tend to be more walkable as services and shops are located closer to homes.

## Requirements Outline:
### Functional requirements: 
- Data Loading: In order for my program to load certain file types and handle file errors, I will create a program that is able to load it based on the file extension (eg. txt, csv, json). This will allow my program to load data in formats that match those that it is programmed for.

- Data Cleaning: In order to clean my datasets, I will be using pandas. To identify missing values, the function 'df.isnull().sum()' can be used. I can also drop columns with 'df.dropna', which will allow me to cut down on unnecessary columns with too many missing values. I can also identify duplicates with the df.duplicate() function, and these too can be cut down on with some code.

- Data Analysis: My analysis will incorporate the mean patronage of train suburbs in Australia. This can be done by collecting and totalling all monthly reports for train patronage, and dividing to find averages. This can then be linked to average suburb densities. This can be done with 'Column1 + column2 .... ColumnX/X. I will not be using median or mode in my data analysis, but may focus and max and min functions.

- Data Visualisation: I plan on using matplotlib to plot the comparisons between density and train patronage in Australian suburbs. This will allow me to create a visual representation of the patterns, and come to an eventual conclusion on my thesis statement. Functions I might be using in matplotlib include 'matplotlib.pyplot.plot(), which will allow me to create the comparing lines in the program. Furthermore, by using x and y coordinates, I can create a general curve to outline the relation to the density.

- Data Reporting: The output given by the system will incorporate the UI interface given with matplotlib. It will also utilise the text based interface on VScode. However, with my datasets, I will use the extensions txt and csv. These will mostly be for my density and train patronage charts.

### Non Functional Requirements:
- Usability: the README document requires the project title, description, table of contents, technologies used, and more. It is a way for the user to get a basic understanding of the purpose of repository. It is useful to give a basic idea of what the user's role in the system is, and how to interpret the repository. Furthermore, in the programming aspect, It is vital to create an easy to use UI that has clarity and consistency.

- Reliability: To ensure that no errors occur, they program will have to contain 'Else:' statements that make sure that if an error occurs from the User's side, they are redirected to start again or notified of the error. By using If-Else statements, a simple system can be created to account for all possible instances, minimising error and inconsistency in the UI.


## Use Case:

Actor: User

Goal: To access, interact and manipulate given data, as well as get an understanding and come to an opinion about the thesis statement.

Preconditions:
- Datasets have been loaded into the program, and any required intallations of extensions has been completed
- The user has access to the system interface

Main Flow:
1. User opens the program
2. UI opens with a home screen in which the thesis statement and hypothesis is outlined
3. A screen is given with options on what dataset and what medium (eg. graph, text based etc) they would like to view it in.
    a) a table
    b) a graph
    c) text based
4. System displays chosen form
5. System gives option to exit or to continue
6. If continuing, system provides option menu again
    a) a table
    b) a graph
    c) text based
7. If exiting: end program

### Post conditions:
- viewer has interacted with system
- viewer has gained an opinion on the accuracy of thesis statement
- viewer has gained information from the datasheets
- Valid updates have been saved to the system
- Data remains available and updated for additional analysis (if required)


## Phase 2: Research

### Research:
- https://data.nsw.gov.au/data/dataset/?tags=train (Train patronage information NSW): This csv spreadsheet shows the train patronage in NSW suburbs, also showing how some stations may be busier than others. It shows the monthly train patronage (exit and entry), and also shows the train_ID (which may have to be dropped as a column altogether)

- https://www.vic.gov.au/transport-patronage (Train patronage information VIC): This is similar to NSW train patronage, although more work will be done on dropping columns, getting rid of missing values and loading the csv file in the UI as a '.csv'.

- https://en.wikipedia.org/wiki/Urbanization_in_Australia (Wikipedia urbanisation in Australia) This document shows the rapid urbanisation in Australia, also showing the densest suburbs in Australia using the metric 'people/sq km'. By creating a csv myself with this information, I can convert all the given values into a matplotlib graph or chart to compare the data with my train patronage. (I will have to use functions to only add suburbs with train stations to the list as many of the suburbs have no clear transit stops)

Research Information: According to my research, I have found that train patronage DOES have a direct impact to density, and vice versa. This can be seen in information from NSW Transport, with a publicly available dataset of train patronage in all stations of the NSW train network (including sydney Metro) With the top stations including Town Hall, Central, Burwood, Strathfield, Wynyard, Parramatta and Chatswood. The two main reasons for them being the busiest include service and density. Service refers to the number of trains that pass through the station on an hourly basis, as well as the number of interchanges. With the busiest station being Town Hall station in sydney, which has interchanges with nearly all the lines in the entire network, it should come as no surprise that the station is the busiest. This is despite the fact that Central station, having one more interchange, is still the largest by size and number of platforms in Australia.
The reason behind Town Hall's bus


### Chosen Issue:
- Suburbs in Australia are sprawling, and are not very dense on average. In fact, on a recent survey done by the ABS on population, Sydney was shown as having a population density of under 400 people/sq km. The recommended population density for a sustainable urban area, being around 10 000 people/sq km is orders of magnitude larger than Sydney's. Furthermore, as marked by the ABS, Sydney's area in sq km is over 12 000 km squared, making it one of the most sprawling cities in the world.
- Having found that the large majority of Sydney-Siders are now forced to live west, several kilometres from the CBD where the vast majority of jobs are located, Public transport and general transport infrastructure is getting harder and harder to upgrade as more of the city moves away from the urban core.
- My program seeks to show information using graphs and charts on how this shift in population is resulting in lower population densities, therefore resulting in lower transit patronage and ridership. Of course, I have researched external factors and have tried to come to a conclusion as accurate as possible by getting rid of external factors such as businesses, TOD (transit oriented developement), and interchanges.


#### Secondary Sources:
- https://www.abs.gov.au/statistics/people/population/regional-population/latest-release
- https://thepropertytribune.com.au/market-insights/how-population-density-is-reshaping-australian-cities/
- https://www.thenewdaily.com.au/finance/property/2018/10/04/apartment-boom-high-density-suburbs


## SEE - I paragraph:
- density directly impacts transit use in Australian suburbs, assuming similar general population, car ownership rates and interchanges.
This means that in Australian suburbs that are denser than their more suburban counterparts general perform better with transit ridership in the form of trains, metro, light rail and bus.
This can be seen for example in the two suburbs 'Chatswood' and 'Edmonson Park'. These two sydney suburbs are polar opposites in terms of price and public transport options. With a density of approximately 2900 people/sq km, Edmonson park is denser than Sydney's average, but still far less than the recommended 10 000 people/sq km for an urban area. With only 1.9 million station entries in 2023, Edmonson park was only the 65th busiest station on the Sydney network. In comparison, Chatswood, with a population density of 14 800 people/sqkm, just over the recommended, the transit ridership is significantly higher with almsot 15 million entries in the same time period. This shows the importance of density when near good public transit, and its ability to reduce car ownership in Australian suburbs and improve general urban planning of the area.

- Issues for/against
while the comparison between chatswood and Edmonson park shows the role of density in predicting transit ridership, what it fails to account for is transit connectivity. With Chatswood far closer to the urban core, and with an extra metro interchange with the M1 line, Chatswood clearly has an advantage over Edmonson park, which is located 40 kilometres from the CBD and is located in a developing region of Sydney that has not achieved its full capacity at the urban fringes.
- On the other hand, while Chatswood does have an interchange with the metro, it is safe to say that even if Edmonson Park recieved an interchange, assuming the current urban planning of both areas holds, Chatswood's ridership would far outcompete Edmonson Park, mostly as a result of the population. In addition, the data used depicts the entry/exit ridership, not really accounting for any interchanges as those are hard to monitor and to calculate. This downplays the importance of extra interchanges and focuses most of the data on the important aspect of population density.



# Data dictionaries:
## 'NSW_Train_patronage_per_station.csv'
|Field|Datatype|Format for Display|Description|Example|Validation|
|---|---|---|---|---|---|
|_id|integer|NNNN|Identification|27|Must be a number|
|station|String|X|Station names|Barangaroo Station|Must be a string ending with station|
|Entry|integer|NNNNNNN|Number of entries to station|Town Hall Station: 1768424|must be a number|
|Exit|integer|NNNNNNN|Number of station exits|Chatswood Station: 1529846|must be a number|
|Total|integer|NNNNNNN|Total station patronage|Circular Quay: 1327691|must add entry and exit|

## Suburban population densities
|Field|Datatype|Format for Display|Description|Example|Validation|
|---|---|---|---|---|---|
|Suburb|string|X|Identification|Hornsby|Must be a string value and be an Australian Suburb|
|station|String|X|Station names|Hornsby Station|Must be a string ending with station|
|Density|integer|NNNNNNN|Population density (per square kilometre)|14 800|must be a number|


## Code Analysis:

### SEE-I Paragraph
- My project has been articulated successfully and has shown sufficient signs of clarity, accuracy and ease of use. This means that I have been able to successfully create a project that reflects my thesis statement, which will allow the user to get a better understanding of the topic and my idea. I have utilised many techniques in programming to achieve the most efficient result that is able to filter and process information loaded from csv files, create dataframes using pandas, graph out the results using matplotlib and combine all of these features seamlessly into a text-based User Interface that is simple to use, minimises errors on the user's side, and successfully answers my thesis question. This project therefore works like a well oiled machine, it is easy to use with minimal friction between the User and the computer, reliable due to its fool-proof code and efficient as a result of the simplified coding that has been used to complete the project.

### Errors:
- I noticed that the program was unable to load the csv. I was able to fix this by matching the file extensions in my programming (this mean't that I had to use csv instead of txt in my code). I was also unable to install pandas, but realised That I needed to create a virtual environment known as '.venv' to install it using 'pip install pandas'.
- Though my code was working fine, I had issues with a few parts of data cleaning with the train patronage information. This was due to the fact that entry and exit logs were seperate from each other. This meant that I had to use the 'groupby()' function to group the column 'Station'. This allowed me to total up the entry and the exit from each station, finally allowing for a cleaner result in my pandas dataframe.
- To create my matplotlib, I had issues with the formatting as when I ran the code, the screen showed too many total values. For this reason, I decided to make a plot graph rather than a line graph, as this would show much more detail and information more relevant to my thesis question, showing how the train stations' patronage grew as density grew.
- Connecting values to my thesis question was more difficult then I initially assumed. This was likely because it was difficult to see patterns in the values. With some peer reviews however, I gained more confidence in the link between my thesis and my results, and have come to the conclusion that the study was successful.

### Hypothesis Conclusion
#### Inital Hypothesis: "Suburbs with higher density yield better public transport usage"

## Conclusion: 
With the data that I have gathered through various csv files from NSW transport and the ABS (Australian Bureau of Statistics), I have found a strong relation between population density and public transport useage. Referencing the dataframe that I made by combining the csv files 'Suburb_density_data.csv' and 'monthly_usage_pattern_train_data-june-2024.csv', most of the top 100 stations in the NSW network have significantly higher population densities then similar train stations with less patronage. This is also true vice versa, where the trains stations located in less dense areas have less train patronage.
My theory as to why this is true is known as transit-oriented development. With density, more individuals are able to live closer to the train station, which therefore generates more patronage. On average, the acceptable distance, walkable from a train station is around 800m. Given a radius of 800m from the train stations, this means only around 2 square kilometres of valuable land. Where some train stations such as chatswood have a CBD density of 14 800 square kilometres, giving around 30 000 people nearby access to a train station, some stations have far less, therefore generating less patronage.

##### Outliers:
 There have been some outliers, with most of these being stations that contain one or more interchanges. With more interchanges, stations have better patronage as they have accessibility in more directions, with more lines. However, some other outliers include wolli creek, a station with a density of over 16 000 people/sqkm. However, due to the small, localised area of the suburb and the lack of businesses, its transit ridership is not compareable to stations like Chatswood and Town hall, despite both having lower population densities. however, one correlation is the percentage of population utilising public transport for their commute. Wolli creek has the highest ridership percentage of any suburb in the city of sydney, with almost 70% of residents choosing public transport over the automobile.

## Peer Evaluation:

|plus|minus|implication|
|---|---|---|
|ease of use, clear and concise|Could have used GUI|Clear, but boring; Could use GUI instead of text based|
|Answers thesis statement well|Does not have many options to view data with|Could add more ways to view data, but otherwise clear|
|Clearly shows data|No way to manipulate or change the data|Could allow user to edit and save data|
|Clearly documented project|Could elaborate more on issues regarding code|add more information on how you coded|
|Helpfully illustrated the connection to thesis statement|Could have shown outliers|Add a section to view outliers|
|Provided clear conclusion|Maybe required more work in introduction with thesis statement|display thesis statement at the start|

## Project Evaluation:

#### Evaluate your system and results in relation to your Requirements Outline:
- My project is able to successfully load data by recognising the type of file it is with the file extension (csv, txt etc). It is also able to use data cleaning to create an easy-to-read pandas dataframe by identifying missing values, dropping columns with the missing values, using the groupby() function to get rid of the alternating Entry/Exit column and even identify and get rid of duplicates. I also successfully used data analysis in my pandas dataframe as well as my matplotlib graph by clearly identifying and showing the connection between density and patronage in suburbs with public transport. Using matplotlib allowed me to use data visualisation as well, creating a clear way for the user to also identify the pattern, although this was also thoroughly pushed in the UI to make it easier for the user.
- In terms of non functional requirements, my program used a clear and consistent UI with ease of use. I also made sure that my repository contained a readme with all the basic information about my project, such as the table of contents and installation information.
To make sure the program was as reliable as possible, I also made sure to create IF Else statements that accounted for all mistakes from the user's side, creating a consistent program that ran reliably.

#### Evaluate your system in relation to peer feedback
- My project has improved substantially since I recieved feedback from my peers. I added my thesis statement to the start of the program to outline the question I was going to be answering, as well as important information on how to go around the user interface. Furthermore, I added more ways for the user to view the data, making it substantially more customisable. Regarding the theory aspect of my project, I was given feedback to add more information about the issues I faced with my programming. Therefore, I added some issues that I faced in my programming in my evaluation. 
Some feedback I was unable to attend to included showing the outliers, which was difficult as I only provided information regarding the density of the top 100 stations' patronage. This made it extremely difficult to pinpoint outliers, which therefore resulted in me scrapping the idea entirely. I also decided against letter the user manipulate or change the data, as I found that it would be inconsistent and often incorrect if a user altered the dataset. 

As a whole, I believe that my final project resembled most of my inital vision while also fixing many of the issues my peers had with the initial draft for the code and the repository. I believe that I have successfully answered the question I asked in my inital thesis statement.

#### Evaluate your project in relation to project management

As this was my second attempt, I had much more confidence with using VScode and github repositories. I also found the coding aspect less challenging, although definitely still not my strong suite. In terms of timeline, I completed the majority of initial theory work (parts 1 and 2) on the first day. This resulted in me leaving most of the time for coding (part 3) on the second day. I finished coding on the 3rd day, and also finalised my evaluation. On the 4th day I edited and made sure that my project was ready for submission. This shows that I had no issues regarding time management due to this being my second attempt. During my first attempt, however, I took over 2 weeks to complete the task. This has been a major improvement in overall quality and in project management. Although I faced some issues with my code, I was able to quickly fix them through research in websites such as W3 schools, as well as language models such as ChatGPT, which gave me an opportunity to learn more about what I was coding, and improve my planning to become more efficient. However, with the construction of the program itself, NO AI whatsoever was used for generation.

#### Evaluate your system in in relation to its data and security
##### Is the data valid, accurate and timely? 
- The data is valid, timely and accurate, although the density values are generally estimates coming from the ABS. This was due to the time constraints that prevented me from manually gathering all the population densities. In addition, many stations (eg. Town Hall and Central) were located in suburbs with different names to their stations (Haymarket and Sydney CBD).
##### Is it unbiased?
- Seeing as the data of train stations and density is from concrete sources and the topic does not involve many issues with bias, not being a subjective one, my project is not at all biased and has been completely objective and transparent with its sources.
##### Do we need to improve its security – if so, how? 
- Security can be improved by utilising a passcode. Seeing as this project does not require any protection due the user's inability to manipulate or change the data, this is not necessary. If it was, however, a passcode and a simple IF/Else statement would be enough to create a layer of protection. In theory, encryption could also be utilised however.
##### Could the UX be more accessible – if so, how?
- The user interface is already as accessible as possible, utilising an extremely basic form of text interface that uses numbers and letters as an input to dictate where to proceed. If it was required to be even more accessible, however, by clearly documenting and explaining the way to naviage the UI in the menu, it would be even easier to use. In addition, the lack of a keyboard and the documentation of any required installations should make it easy for anyone to use the program and view the project as a whole.