## KNN of Cities Most Comparable to Missoula

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

# Load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
# We will use the first 100 samples for training and the remaining for testing
X_train, y_train = X[:100], y[:100]
X_test, y_test = X[100:], y[100:]

# Create a KNN classifier with k=3 (3 nearest neighbors)
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier on the training data
knn.fit(X_train, y_train)

# Predict the classes of the samples in the testing set
predictions = knn.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = np.mean(predictions == y_test)
print("Accuracy:", accuracy)


Explanation: 

KNN classifies the new data points based on the similarity measure of the earlier stored data points. For example, if we have a dataset of tomatoes and bananas. KNN will store similar measures like shape and color. When a new object comes it will check its similarity with the color (red or yellow) and shape.

In the code block above, we split data into training and testing sets, using the first 100 samples for training and the remaining for testing. This is a common method for evaluating the performance of machine learning algorithms.

Next, we create a KNN classifier using the KNeighborsClassifier class from the sklearn.neighbors module. We specify that we want to use n_neighbors=3, meaning that the classifier will use the 3 nearest neighbors when making predictions.

We then train the classifier on the training data using the fit method, and use the predict method to make predictions on the samples in the testing set. Finally, we calculate the accuracy of the classifier by comparing the predicted classes to the true classes of the samples in the testing set.



Metrics: top categories: Budget size, Number of Parks or Sites Maintained, FTEs, Jurisdiction Population, Jurisdiction Population Per Sq Mile 

In [29]:
#trying to extract info from pdf, having issues with following two code chunks  
# extract_doc_info.py

from PyPDF2 import metadata

pdf_path = '/Users/krusty/Desktop/Capstone/ProvoPNR.pdf'
   
def extract_information(pdf_path):
    with open(pdf_path,'rb') as f:
        pdf = metadata(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()

    txt = f"""
    Information about {pdf_path}: 

    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """

    print(txt)
    return information

if __name__ == '__main__':
    path = '/Users/krusty/Desktop/Capstone/ProvoPNR.pdf'
    extract_information(path)

ImportError: cannot import name 'metadata' from 'PyPDF2' (/Users/krusty/opt/anaconda3/lib/python3.9/site-packages/PyPDF2/__init__.py)

In [3]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as f:
        #PyPDF2 = (pdf_path)
        #pdf = PyPDF2.PdfFileReader(f)
        pdf = pdf_path(f)
        num_pages = pdf.getNumPages()
        text = ''
        for i in range(num_pages):
            page = pdf.getPage(i)
            text += page.extractText()
        return text

pdf_file_path = '/Users/krusty/Desktop/Capstone/ProvoPNR.pdf'
text = extract_text_from_pdf(pdf_file_path)
print(text)

TypeError: 'str' object is not callable

## NRPA Percentile Plot

Explanation: 

In this code, we first load the data from the CSV file using the pandas.read_csv() function. This function reads the CSV file and returns a DataFrame object. We then compute the percentiles of a specific column using the quantile() method. This method computes the 25th, 50th, and 75th percentiles of the specified column.

We then create a box plot of the column using the matplotlib.pyplot.boxplot() function. This function creates a box-and-whisker plot of the column. We add horizontal lines for the percentiles using the matplotlib.pyplot.axhline() function, which adds a horizontal line to the plot at the specified y-value.

Finally, we customize the plot by setting the plot title, x-axis label, and y-axis label using the matplotlib.pyplot.title(), matplotlib.pyplot.xlabel(), and matplotlib.pyplot.ylabel() functions, respectively.

Note that the path/to/csvfile.csv in the code should be replaced with the actual path to the CSV file on your computer, and the column_name should be replaced with the name of the column you want to compute the percentiles for. Additionally, you can customize the plot further by changing the colors, linestyles, or other properties of the plot.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the data from the CSV file
data = pd.read_csv('/Users/krusty/Desktop/Capstone/NRPA Aggregate Averages Data.csv')

# Compute the percentiles
percentiles = data['column_name'].quantile([0.25, 0.5, 0.75])

# Create a box plot
plt.boxplot(data['column_name'])

# Add horizontal lines for the percentiles
plt.axhline(percentiles[0.25], color='r', linestyle='--')
plt.axhline(percentiles[0.5], color='g', linestyle='--')
plt.axhline(percentiles[0.75], color='b', linestyle='--')

# Customize the plot
plt.title('Percentiles')
plt.xlabel('X label')
plt.ylabel('Y label')

# Display the plot
plt.show()


KeyError: 'column_name'

## K-Means Cluster Analysis 

In [25]:
from sklearn.cluster import KMeans
import numpy as np

# Generate sample data
np.random.seed(0)
data = np.random.rand(100, 2)

# Fit the model
kmeans = KMeans(n_clusters=3, random_state=0).fit(data)

# Predict the cluster labels for each data point
labels = kmeans.predict(data)

# View the cluster centers
cluster_centers = kmeans.cluster_centers_
print(cluster_centers)
#

[[0.53191784 0.17092862]
 [0.22477318 0.67757842]
 [0.7591731  0.6383279 ]]


#nps data: https://www.nps.gov/subjects/gisandmapping/tools-and-data.htm
#nps APIs https://www.nps.gov/subjects/developer/api-documentation.htm
#CDC EJI https://eji.cdc.gov/launcher.html


In [31]:
#Cluster analysis from NRPA Data https://nrpaparkmetrics.com/NRPA/Reports/ERReports/APRT.aspx

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load the data
data = pd.read_csv('/Users/krusty/Desktop/Capstone/NRPADataAll.csv')

# Preprocess the data
X = data[['Region', 'Year']].values

# Fit the model
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

# Predict the cluster labels for each data point
labels = kmeans.predict(X)

# View the cluster centers
cluster_centers = kmeans.cluster_centers_
print(cluster_centers)

# Visualize the results
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], marker='x', s=200, linewidths=3, color='r')
plt.xlabel('Region')
plt.ylabel('Year')
plt.show()

#top categories: Budget size, Number of Parks or Sites Maintained, FTEs, Jurisdiction Population, 
#Jurisdiction Population Per Sq Mile 

ValueError: could not convert string to float: 'WI'

To plot a geoid in Python from a CSV file, you will need to first convert the CSV file to a GeoDataFrame, and then use the plot() method to create a map of the geoid. Here's some example code to get you started:

In this code, we first load the data from the CSV file using the geopandas.read_file() function. This function reads the CSV file and returns a DataFrame object. We then convert the data to a GeoDataFrame using the gpd.GeoDataFrame() function. This function takes the DataFrame and the longitude and latitude columns as input, and creates a new GeoDataFrame with a Point geometry column.

We then plot the geoid using the GeoDataFrame.plot() method, which creates a map of the geoid. We customize the plot by setting the figure size using the matplotlib.pyplot.figure() function, setting the plot title using the matplotlib.pyplot.title() function, and turning off the axis labels using the matplotlib.pyplot.axis() function.

Finally, we display the plot using the matplotlib.pyplot.show() function.

Note that the path/to/csvfile.csv in the code should be replaced with the actual path to the CSV file on your computer. Additionally, you may need to adjust the longitude and latitude column names in the gpd.points_from_xy() function to match the column names in your CSV file.

In [1]:
import geopandas as gpd
import matplotlib.pyplot as plt

# Load the data from the CSV file
data = gpd.read_file('/Users/krusty/Desktop/Capstone/MTEJI22.csv')

# Convert the data to a GeoDataFrame
geodata = gpd.GeoDataFrame(data, geometry=gpd.points_from_xy(data.Longitude, data.Latitude))

# Plot the geoid
geodata.plot(figsize=(10,10))

# Customize the plot
plt.title('Geoid')
plt.axis('off')

# Display the plot
plt.show()


AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'

## Analyzing Comparable City Comprehensive Plans

In [19]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        resource_manager = PDFResourceManager()
        string_io = StringIO()
        converter = TextConverter(resource_manager, string_io, codec='utf-8', laparams=LAParams())
        page_interpreter = PDFPageInterpreter(resource_manager, converter)
 
        for page in PDFPage.get_pages(file, caching=True, check_extractable=True):
            page_interpreter.process_page(page)
 
        text = string_io.getvalue()
 
        # close open handles
        converter.close()
        string_io.close()
 
        return text

if __name__ == '__main__':
    print(extract_text_from_pdf('/Users/krusty/Desktop/Capstone/BPRDPlan.pdf'))
    
   

BEND PARK & RECREATION DISTRICT
COMPREHENSIVE PLAN

Due to the nature of this document, not all pages are in an accessible format. Please contact Quinn 

Keever at Quinn@bendparksandrec.org to make an accommodation request.

ADOPTED JULY 2018 

A MESSAGE 
FROM THE 
EXECUTIVE 
DIRECTOR

Abraham Lincoln once said, “Give me six hours to 
chop down a tree and I will spend the first four hours 
sharpening the axe.” 

Planning the future of your park and recreation district 
is one of the most important things we can do to assure 
that the resources we are entrusted to manage are put 
to good use.  Bend Park and Recreation District listens 
to our residents’ needs and desires and does our best 
to achieve the community’s vision for their park and 
recreation system. Listening to our residents has helped 
us develop one of the most diverse recreation programs 
in the state, and a park system envied across the nation.  
We could not do this without your input, your support, 
and your trust.


In [24]:
pdf_text = extract_text_from_pdf('/Users/krusty/Desktop/Capstone/BPRDPlan.pdf')
word = "park"
word2 = "geospatial"

#if word in pdf_text:
    #print(f"The word '{word}' was found in the text")
#else:
    #print(f"The word '{word}' was not found in the text")

count = pdf_text.count(word)
count2 = pdf_text.count(word2)
print(f"The word '{word}' was found {count} times in the text")
print(f"The word '{word2}' was found {count2} times in the text")

#Provo: Provo found 247 times in the text and climate found 3 times 
#Bend: Bend found 179 times, climate found 0, sustainable 1 time, inclusive-1, housing-5
#park-471, geospatial-9
    
    


The word 'park' was found 471 times in the text
The word 'geospatial' was found 9 times in the text


In [None]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def extract_text_from_pdf2(pdf_path):
    with open(pdf_path, 'rb') as file:
        resource_manager2 = PDFResourceManager()
        string_io2 = StringIO()
        converter2 = TextConverter(resource_manager, string_io, codec='utf-8', laparams=LAParams())
        page_interpreter2 = PDFPageInterpreter(resource_manager, converter)
 
        for page in PDFPage.get_pages(file, caching=True, check_extractable=True):
            page_interpreter.process_page(page)
 
        text = string_io.getvalue()
 
        # close open handles
        converter.close()
        string_io.close()
 
        return text

if __name__ == '__main__':
    print(extract_text_from_pdf('https://www.bendparksandrec.org/wp-content/uploads/2018/07/BPRD-Comp-Plan-Adopted-for-web.pdf'))
    
   

# Building Website Request to Scrape Similar Metrics

In [None]:
import requests
from bs4 import BeautifulSoup

In [3]:
#potential: Bend, OR; Provo, UT; Bozeman, MT; Billings, MT

In [4]:
# Make a request to the website
url = "https://www.bendparksandrec.org/wp-content/uploads/2018/07/BPRD-Comp-Plan-Adopted-for-web.pdf"
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


In [None]:
# Find the table containing the metrics
    table = soup.find("table", {"id": "metrics-table"})

    # Extract the data from the table
    rows = table.find_all("tr")
    metrics = {}
    for row in rows:
        cols = row.find_all("td")
        if len(cols) == 2:
            metric_name = cols[0].text.strip()
            metric_value = cols[1].text.strip()
            metrics[metric_name] = metric_value

    # Print the metrics
    print(metrics)
else:
    print("Request failed with status code:", response.status_code)

## Code for Regression Analysis

In [32]:
# Import the necessary libraries
import pandas as pd
import statsmodels.formula.api as smf

# Read in the data
parks_data = pd.read_csv("/Users/krusty/Desktop/Capstone/NRPADataAll.csv")

# Check the structure of the data
print(parks_data.info())

# Perform the regression analysis
model = smf.ols("num_visitors ~ park_size + num_facilities + maintenance_budget", data=parks_data).fit()

# Check the summary of the model
print(model.summary())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7947 entries, 0 to 7946
Columns: 245 entries, Agency to Have hiring practices and policies that promote a diverse agency workforce?
dtypes: float64(38), int64(1), object(206)
memory usage: 14.9+ MB
None


PatsyError: Error evaluating factor: NameError: name 'num_facilities' is not defined
    num_visitors ~ park_size + num_facilities + maintenance_budget
                               ^^^^^^^^^^^^^^

Code Explanation: 

The smf.ols() function in Python's statsmodels package is used to perform Ordinary Least Squares (OLS) regression analysis.

In the code smf.ols("num_visitors ~ park_size + num_facilities + maintenance_budget", data=parks_data), we are specifying the regression formula. The formula specifies that the dependent variable num_visitors is regressed on the independent variables park_size, num_facilities, and maintenance_budget. The ~ sign in the formula separates the left-hand side (dependent variable) from the right-hand side (independent variables), and the + sign is used to separate the independent variables.

The data parameter is used to specify the dataset to use in the regression analysis. In this case, parks_data is the DataFrame that contains the data on the parks and recs metrics.

The fit() method is used to fit the regression model to the data. Once the model has been fit, the resulting object model contains information about the regression coefficients, standard errors, t-statistics, and p-values, among other things.

Therefore, the entire code model = smf.ols("num_visitors ~ park_size + num_facilities + maintenance_budget", data=parks_data).fit() is creating a linear regression model where num_visitors is the dependent variable and park_size, num_facilities, and maintenance_budget are the independent variables. It then fits the model to the parks_data dataset and stores the results in the model object.