# MAT 381, Lecture 12 (Review)


## Random Variables

All data are sampled from random variables, which are variables whose possible values are outcomes of a random event. They are used to model random phenomena, and can be either discrete or continuous.

Discrete random variables can only take on a specific set of values. For example, the number of heads that result from flipping a coin is a discrete random variable, as it can only take on the values 0, 1, or 2 (assuming the coin is not flipped more than twice). A continuous random variable, on the other hand, can take on any value within a certain range. For example, the height of a person is a continuous random variable, as it can take on any value within a certain range (e.g. between 1 foot and 8 feet). 

Here are several common tasks that we performed with datasets sampled from random variables:


* Expected value is a measure of the central tendency of the variable, and is calculated as the sum of the product of each possible value of the variable and its probability. The expected value can be thought of as the average value of the variable.

* Variance of a random variable is a measure of its spread or dispersion, and is calculated as the sum of the squared differences between each possible value of the variable and its expected value, divided by the number of possible values. The standard deviation is the square root of the variance, and is a measure of the spread of the variable in terms of its expected value.

* Correlation is a measure of the relationship between two random variables. Positive correlation means that the two variables tend to move in the same direction (e.g. as one variable increases, the other variable also increases), while negative correlation means that the two variables tend to move in opposite directions (e.g. as one variable increases, the other variable decreases).

* Hypothesis testing is a statistical procedure used to test whether a claim or hypothesis about a population parameter (such as the mean or variance) is true, based on a sample of data drawn from the population. Hypothesis testing involves formulating a null hypothesis (the claim to be tested) and an alternative hypothesis (the claim that the null hypothesis is false), and determining the probability of obtaining the sample data if the null hypothesis is true.


## Numerical vs Categorical Data

Numerical data and categorical data are two main types of data that we worked with during our course.

Numerical data can be measured or quantified, and is represented by numbers, or in general by vectors. Such data can be sampled from a continuous random variable or a discrete radom variable. Continuous numerical data can take on any value within a certain range (e.g. height, weight), while discrete numerical data can only take on specific, distinct values (e.g. the number of students in a class).

Categorical data refers to data that can be organized into categories or groups. Categorical data is often represented by text or symbols, and cannot be meaningfully ordered or ranked. Categorical data can be either nominal or ordinal. Nominal categorical data consists of categories that do not have a natural order (e.g. eye color, hair color), while ordinal categorical data consists of categories that have a natural order (e.g. low, medium, high).

The differences between numerical and categorical data can have important implications for data analysis tasks. For example:

* We use different summary statistics for numerical and categorical data. For numerical data, common summary statistics include mean, median, and standard deviation, while for categorical data, common summary statistics include frequency and percentage.

* We use different types of plots for numerical and categorical data. For numerical data, common plots include histograms, scatterplots, and boxplots, while for categorical data, common plots include bar plots and pie charts.

* We use different techniques are used for cleaning and preparing numerical and categorical data. For example, missing values in numerical data can often be imputed using techniques such as mean imputation or multiple imputation, while missing values in categorical data may need to be handled differently (e.g. by replacing missing values with a separate category).

* We use different models for numerical and categorical data. For example, linear regression and logistic regression are commonly used for numerical and categorical data, respectively.

## Structured Data

Structured data is data that is organized in a well-defined format, such as a table or a spreadsheet. It typically consists of rows and columns, with each column representing a different attribute or feature of the data, and each row representing a data point or record.

There are many different types of structured data, depending on the specific characteristics and purpose of the data. Here are some examples of different structured data types:

* Tabular data: Tabular data is data that is organized into a table, with rows representing data points and columns representing attributes or features. Examples of tabular data include data stored in spreadsheet formats (such as CSV or Excel), data stored in a database table, and data organized as a dataframe in a programming language such as Python.

* Hierarchical data: Hierarchical data is data that is organized into a tree-like structure, with each data point having one or more child points. Hierarchical data is often used to represent relationships between data points, such as the parent-child relationship between categories in a product catalog.

* Graph data: Graph data is data that is organized into a graph structure, with data points representing nodes and relationships between data points represented as edges. Graph data is often used to represent complex relationships between data points, such as social networks or networks of interconnected data points.

## Structured Data Types

### Tabular/Columnar Data

Columnar data is data that is organized into a table-like structure, with rows representing individual observations and columns representing variables or features. Columnar data is a common format for storing and manipulating data in databases, spreadsheets, and other data management systems.

One of the main advantages of columnar data is that it allows for efficient querying and manipulation of data using SQL or other query languages. Columnar data is also well-suited for storing and manipulating large datasets, as it allows for fast access to specific columns or subsets of the data without needing to read the entire dataset into memory.

In addition to these benefits, columnar data also has some limitations. One potential drawback of columnar data is that it can be difficult to represent complex relationships between variables or to store data with irregular or nested structure. In such cases, other data formats, such as hierarchical data or graph data, may be more suitable.

![Land_Slides](./images/land-slides.png)

(Image: A sample of landslides that happened in Turkey. Data is from USGS.)

During the course we heavily used [UCI's data repository](https://archive.ics.uci.edu/ml/datasets.php) for example datasets, [python](https://www.python.org/) and its tools and libraries for analyzing columnar data. During this course we mostly used [Pandas](https://pandas.pydata.org/). 

### Time Series Data

Time series data is a type of structured data, as it is typically organized into a table or dataframe with rows representing time points and columns representing different variables or attributes. A time series is a sequence of data points that are collected over time. The data is often collected at regular intervals, such as every hour, day, week, or month. Time series data can come from a wide range of sources, such as financial data, sensor data, social media data, and more.

![A_Time_Series](./images/time-series.png)

(Image: Paleohydrologic reconstructions of water-year streamflow for 31 stream gaging sites in the Missouri River Basin with complete data for 1685 through 1977.)

We can analyze time series in different ways, depending on the specific needs of the analysis and the characteristics of the data. Here are some common techniques for analyzing time series data:

* Forecasting: Time series forecasting is the process of predicting future values of a time series based on its past values. Forecasting can be useful for predicting future demand, sales, or other quantities of interest, and can be done using a variety of techniques, such as exponential smoothing, autoregressive integrated moving average (ARIMA) models, or machine learning algorithms.

* Decomposition: Time series decomposition is the process of breaking down a time series into its component parts, such as trend, seasonality, and noise. Decomposition can be useful for understanding the underlying patterns and trends in the data, and can be used as a preprocessing step for other types of analysis.

* Anomaly detection: Time series anomaly detection is the process of identifying unusual or unexpected patterns in a time series. Anomaly detection can be used to identify problems or issues in a system, or to detect fraudulent or unusual behavior in financial or other types of data.

* Time series visualization: Time series visualization is the process of creating graphs or plots to visualize the trends and patterns in a time series. Visualization can be useful for identifying trends, spotting anomalies, and communicating the results of time series analysis to others.

Python has many tools and libraries available for analyzing time series data. During this course we used [Pandas](https://pandas.pydata.org/), [scikit-learn](https://scikit-learn.org/stable/), [Statsmodels](https://www.statsmodels.org/stable/index.html), and many other libraries. 

### Network/Graph Data

Network data is data that represents relationships or connections between entities or nodes. Network data is typically represented as a graph, with nodes representing the entities and edges representing the relationships between them. There are many different types of data that can be considered as network data, depending on the specific characteristics and purpose of the data. One can analyze such data using techniques such as network centrality, community detection, shortest path analysis, vulnerability analysis, or information diffusion. 

Here are some examples of different types of network data

* Social network 
* Communication network data
* Transportation network data 
* Infrastructure network data

Social network data is data that represents relationships between individuals or groups in a social context. Social network data can be used to study social interactions, influence, and community structure.

![GOT](./images/got-characters.png)

(Image: A network plot of Game of Thrones characters.)

Communication network data is data that represents relationships between individuals or groups based on communication patterns. Communication network data can be used to study information flow, social ties, and collaboration.

Transportation network data is data that represents relationships between locations or nodes in a transportation network. Transportation network data can be used to study the connectivity and accessibility of different locations.

Infrastructure network data is data that represents relationships between infrastructure assets or nodes in a network. Infrastructure network data can be used to study the interdependencies and vulnerabilities of different assets.

The main library we used for network analysis was [networkx](https://networkx.org/). NetworkX is a library for the creation, manipulation, and study of complex networks. It provides a wide range of functions for tasks such as network creation, network visualization, network analysis, and more.  

Here are the alternatives:

* [igraph](https://igraph.org/) is a library for the creation and analysis of complex networks. It provides a wide range of functions for tasks such as network creation, network visualization, network analysis, and more.

* [graph-tool](https://graph-tool.skewed.de/) is a library for the creation, manipulation, and analysis of complex networks. It provides a wide range of functions for tasks such as network creation, network visualization, network analysis, and more, and is optimized for performance.

* [snap-stanford](https://pypi.org/project/snap-stanford/) is a library for working with large networks that provides a wide range of functions for tasks such as network creation, network visualization, network analysis, and more. It is built on top of the [Stanford Network Analysis Platform (SNAP)](http://snap.stanford.edu/), and is designed to be efficient and scalable.

* [PyGSP](https://pygsp.readthedocs.io/en/stable/) is a library for graph signal processing that provides a wide range of functions for tasks such as graph filtering, graph convolution, and graph spectral analysis. It is built on top of NetworkX and NumPy, and is designed to make it easy to work with graphs and signals in Python.


### Common Structured Data Formats

Here are some data formats we used during the class to work with structured data:

1. [Comma Separated Vectors (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values)
2. [Javascript Object Notation (JSON)](https://en.wikipedia.org/wiki/JSON) 
3. [Yet Another Markup Language (YAML)](https://en.wikipedia.org/wiki/YAML) 
4. [Extensible Markup Language (XML)](https://en.wikipedia.org/wiki/XML)
5. [Microsoft Excel Files (XLS,XLSX)](https://docs.fileformat.com/spreadsheet/xls/)
6. [MATLAB Binary Data Format (MAT)](https://www.loc.gov/preservation/digital/formats/fdd/fdd000440.shtml)
7. [Apache Parquet Files](https://arrow.apache.org/docs/python/parquet.html)
8. [Hierarchical Data Format (HDF)](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)

![JSON_Example](./images/json-sample.png)

(Image: A JSON Example.)

### Structured Data Processing Libraries in Python

The main library we used for working with structured data was [pandas](https://pandas.pydata.org/). Pandas is a widely used library for data manipulation and analysis in Python. It provides easy-to-use data structures and data analysis tools for handling and manipulating large datasets. pandas is particularly well-suited for working with tabular data, such as data stored in spreadsheet-like formats (e.g. CSV, Excel).

Here are some alternatives for pandas:

* [Dask](https://docs.dask.org/en/stable/) is a flexible parallel computing library that is built on top of NumPy and pandas. It allows you to scale your data processing and analysis workloads across multiple CPU cores or even distributed across a cluster of machines. Dask is particularly useful for working with large datasets that do not fit into memory, as it can handle data that is stored in external storage (e.g. on disk, in a database) and perform lazy evaluation to avoid reading all data into memory at once.

* [DuckDB](https://duckdb.org/2021/05/14/sql-on-pandas.html) is a column-oriented database management system (DBMS) that is designed to be fast and efficient for analytical queries and data processing tasks. It is written in C++ and has a Python API that allows you to use it as a library for data analysis in Python. DuckDB can handle large datasets and is optimized for in-memory and on-disk processing. It supports a SQL-like query language and can be used to perform complex data transformations and aggregations.

* [SQLite3](https://docs.python.org/3/library/sqlite3.html) is a lightweight, self-contained database management system (DBMS) that is widely used for data storage and management in many applications. It is written in C and has a simple, easy-to-use API that can be accessed from a variety of programming languages, including Python. SQLite3 can be used to store and manage large datasets, and to perform complex data transformations and aggregations using SQL commands. It can be particularly useful for handling data that is structured in a tabular format, and is often used as a lightweight alternative to more full-featured database management systems, as it does not require a separate server process and can be easily embedded in applications. It is also well-suited for use in environments where data needs to be stored locally (e.g. on a device or in a web browser).

* [Vaex](https://vaex.io/) is a high-performance data analysis library for Python that is designed for working with very large datasets (tens of billions of rows). It uses a lazy evaluation model and an optimized in-memory data representation to allow for fast computation and exploration of large datasets without the need to load all data into memory. Vaex supports a wide range of data formats and can be used for tasks such as data filtering, aggregation, and visualization.


### Alternatives to Python

* [R](https://www.r-project.org/) is a programming language and environment specifically designed for statistical computing and data analysis. It has a large and active user community, and is supported by a wide range of libraries and tools for tasks such as data manipulation, visualization, machine learning, and more.

![R_Sample](./images/r-sample.png)

* [Julia](https://julialang.org/) is a programming language specifically designed for scientific computing and data analysis. It has a strong focus on performance and efficiency, and is supported by a wide range of libraries and tools for tasks such as numerical optimization, machine learning, and data visualization.

![Julia_Sample](./images/julia-sample.png)

* [MATLAB](https://www.mathworks.com/products/matlab.html) is a proprietary programming language and environment specifically designed for scientific computing and engineering. It has a wide range of built-in functions and toolboxes for tasks such as data manipulation, visualization, machine learning, and more, and is often used in academia and industry. MATLAB has an open alternative called [Octave](https://octave.org/).

![Matlab_Sample](./images/matlab-sample.png)

* [Scala](https://www.scala-lang.org/) is a programming language that combines elements of functional programming and object-oriented programming, and is often used in data engineering and big data applications. The main selling point of scala is [Apache Spark](https://spark.apache.org/) which is a nified engine for large-scale data analytics. It has a strong focus on performance and scalability, and desingned specifically for tasks such as data manipulation, machine learning, and data streaming.

![Scala_Sample](./images/scala-sample.png)


## Semi-Structured Data

### Geographic/Spatial Data

Geographic/Spatial data falls in between structured and unstructured data.  Spatial data is data that represents geographic locations or objects, and is often used in geospatial analysis. Geospatial analysis is the process of analyzing and interpreting data that has a spatial component, and can be used to study patterns, trends, and relationships in data that is geographically distributed.

Examples of spatial data include:

* Maps
* Location data    
* Satellite imagery

Maps are representations of the Earth's surface or a portion of it, and can be used to show geographic features such as coastlines, rivers, mountains, and cities. Maps can be created in various scales and projections, and can be used to study patterns and trends in data such as population density, land use, or transportation networks.

![Besiktas](./images/besiktas.png)

(Image: A map of Beşiktaş.)

Location data is data that represents the location of objects or devices, and can be used to study patterns and trends in data such as movement, behavior, and mobility. Location data can be collected using various technologies such as GPS, WiFi, or Bluetooth, and can be used to study patterns and trends in data such as transportation patterns, consumer behavior, and social interactions.

Satellite imagery is data captured by satellite sensors, and can be used to study the Earth's surface and atmosphere in detail. Satellite imagery can be used to study patterns and trends in data such as land use, vegetation, and land cover.

### A List of GIS Formats and Geospatial File Extensions 

* ESRI Shapefiles (.SHP, .DBF, .SHX): ESRI shapefile is a standard geospatial file format. The
folllowing three file types are mandatory for a shapefile: DBF (attribute data), SHP (feature
geometry) and SHX (shape index positions). The following files are optional: PRJ (the projection
system), SBN (spatial index to optimize queries), and SBX (to optimize file uptake).

* Geographic JSON (.GEOJSON, .JSON): GeoJSON is a human-readable data format that encodes data in a
variant of JSON which usually has less markup overhead compared to other markup languages GML.

* Geography Markup Language (.GML): GML is another human-readable data format. It is a variant of
XML, and is very verbose compared to GEOJSON.

* Open Street Map Files (.OSM): OSM is another XML-based file format used by the largest
crowdsourcing GIS data project in the world.  OSM also uses a smaller alternate format PBF
(Protocolbuffer Binary Format) which is again based on XML.

* Google Keyhole Markup Language (.KML, .KMZ): This is another XML variant used primarily for
Google Earth. There is also a compressed version (KMZ).

* GPS exchange format (.GPX): This too is a variant of XML that encodes data captured by GPS
receivers, and primarily used for data exchange between GPS software.  GPX records latitude and
longitude coordinates, location, time, and elevation.

* US Census Bureau file formats: Digital Line Graph (DLG) and Geographic Base File-Dual Independent
Mask Encoding (GBF-DIME). DLG files encode information for topographic maps such as contour lines,
roads, railroads, towns, and township lines. Much of the U.S. Bureau of Census Topologically
Integrated Geographic Encoding and Referencing (TIGER) data stored in DLG format. GBF-DIME format
is used mainly for encoding US road network in major urban areas.

* ASCII Grid Files (.ASC): ASC files are specially formatted CSV files with a header.

* GeoTIFF Files (.TIFF, .OVR): GeoTIFF is a variant of the TIFF format which is originally an image
format. Usually GeoTIFF files come with other files: XML (for metadata), TFW (for raster location),
AUX (projection information), OVR (to imporove raster display).

* ERDAS Image Files (.IMG): IMG format is a hiearchical data format that can store hyperspectral
images along with information about ground control points, sensors, or projections.

* ESRI Grid Files (.ADF): ADF files encode data in a grid. The grid can be an integer grid (such as
land cover) or floating point grid (such as elevation.) Attribute data usually is stored alongside
in a separate file.

* ENVI Raster Files (.BIL, .BIP, .BSQ): Band Interleaved data files come in 3 varieties. These
varieties encode hyperspectral images by lines, by pixel, and by sequence. They usually come with a
separate header file (HDR) that contains some metadata for the images such as size, depth, layout
etc.



## Unstructured Data

In this course we worked with the following main classes of unstructured data:

1. Text data
2. Image data
3. Audio data

### Text Data

Text data refers to data that is in the form of written or spoken language. It can include text documents, social media posts, transcripts of audio or video recordings, and any other type of data that consists of words and sentences.

![Word_Cloud](./images/word-cloud.png)

There are many different ways to process and analyze text data, depending on the specific goals of the analysis and the characteristics of the data. Here are some common approaches:

* Text classification is the process of assigning a label or category to a piece of text based on its content. Text classification can be used for tasks such as spam filtering, sentiment analysis, or topic identification.

* Summarization refers to the process of generating a concise and coherent summary of a text or a collection of texts. Summarization can be performed at various levels of granularity, ranging from a summary of a single sentence to a summary of an entire document or a corpus of documents. Summarization is an important task in natural language processing, as it allows people to quickly understand the main points or ideas expressed in a text. It is also useful for tasks such as information retrieval, where a summary of a document can help users understand the content of the document without having to read the entire document.

* Sentiment analysis is about determining the emotional tone or sentiment of a piece of text. It can be used to classify text as positive, negative, or neutral, and can be useful for understanding how people feel about a particular topic or product.

* Keyword extraction is the process of automatically identifying and extracting the most relevant and important words or phrases from a text. Keywords are typically used to summarize the content of a text, to facilitate search and retrieval, or to support tasks such as text classification or topic modeling.

* Named entity recognition (NER) algorithms identify and classify named entities (e.g. people, organizations, locations) in a piece of text. NER can be useful for extracting structured information from unstructured text and can be used for tasks such as information extraction or text summarization.

* Author attribution is the process of determining the identity of the author of a text based on their writing style or other linguistic characteristics. It is often used in tasks such as plagiarism detection, forensic linguistics, and literary analysis, and can be based on various types of linguistic features, including vocabulary, syntax, and structure. There are many approaches to author attribution, including rule-based systems, machine learning-based systems, and hybrid systems. Machine learning-based systems are often based on techniques such as clustering, classification, and feature selection, and have achieved good performance on a wide range of tasks.

* Part-of-speech (POS) taggers are algorithmic devices that assign a grammatical category (e.g. noun, verb, adjective) to each token in a piece of text. POS tagging can be useful for understanding the structure and meaning of the text, and can be used as a preprocessing step for other tasks such as named entity recognition or sentiment analysis.

There are many tools and libraries available for processing and analyzing text data in Python. Some popular ones include [Natural Language Toolkit (NLTK)](https://www.nltk.org/), [spaCy](https://spacy.io/), [Gensim](https://radimrehurek.com/gensim/), and [Classical Language Toolkit (CLTK)](http://cltk.org/). These libraries provide a wide range of functions for tasks such as tokenization, POS tagging, NER, sentiment analysis, and text classification.

### Image Data

Image data is data that represents visual information, such as photographs, videos, or other types of images. Image data can be either raster or vector.

Raster image data is data that is represented as a grid of pixels, with each pixel representing a different color or intensity value. Raster images are often stored in formats such as JPEG, PNG, or TIFF, and are well-suited for representing continuous-tone images such as photographs.

Vector image data is data that is represented as a set of points, lines, and curves, with each element defined mathematically. Vector images are often stored in formats such as SVG, and are well-suited for representing graphics and text.


| Abbreviation | File format                           | File extension(s)                | Summary             |  
|--------------|---------------------------------------|:--------------------------------:|:--------------------|
| GIF          | Graphics Interchange Format           | .gif                             | Good choice for simple images and animations. Prefer PNG for lossless and indexed still images, and consider WebP, AVIF or APNG for animation sequences.                                                                         |
| JPEG         | Joint Photographic Expert Group image | .jpg, .jpeg, .jfif, .pjpeg, .pjp | Good choice for lossy compression of still images (currently the most popular). Prefer PNG when more precise reproduction of the image is required. |
| PNG          | Portable Network Graphics             | .png                             | PNG is preferred over JPEG for more precise reproduction of source images, or when transparency is needed. |
| SVG          | Scalable Vector Graphics              | .svg                             | Vector image format; ideal for user interface elements, icons, diagrams, etc., that must be drawn accurately at different sizes.   |
| BMP          | Bitmap file                           | .bmp                           |  The BMP (Bitmap image) file type is most prevalent on Windows computers, and is generally used only for special cases in web apps and content. |
| TIFF         | Tagged Image File Format              | .tif, .tiff                    |TIFF is a raster graphics file format which was created to store scanned photos, although it can be any kind of image. |

The table above is taken from [here](https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Image_types#common_image_file_types) with some modifications.

Image data is often used in a wide range of applications, such as computer vision, image processing, and machine learning. Some common types of analyses that can be performed on image data include:

* Image classification: Image classification is the process of assigning a label or category to an image based on its content. Image classification can be used for tasks such as object detection or scene understanding, and can be performed using machine learning algorithms or other techniques.

* Object detection: Object detection is the process of identifying and locating objects in an image. Object detection can be used for tasks such as autonomous driving or image tagging, and can be performed using machine learning algorithms or other techniques.

* Image segmentation: Image segmentation is the process of dividing an image into regions or segments based on certain criteria, such as color, texture, or shape. Image segmentation can be used for tasks such as object recognition or image manipulation, and can be performed using machine learning algorithms or other techniques.

* Optical character recognition (OCR): OCR is the process of segmenting and extracting text from a picture that contains an image of a text. OCR systems typically work by analyzing the image of a document and recognizing the individual characters in the text using pattern recognition algorithms and machine learning models. The recognized text is then output as a machine-readable format, such as ASCII or Unicode text, which can be further processed or analyzed.  

* Image restoration: Image restoration is the process of repairing or enhancing an image that has been damaged or degraded in some way. Image restoration can be used to remove noise, blur, or other distortions from an image, and can be performed using techniques such as filtering, interpolation, or image inpainting.

There are many libraries available for image processing tasks in Python. Some popular ones include:

1. [OpenCV (Open Computer Vision)](https://docs.opencv.org/4.x/d6/d00/tutorial_py_root.html) is a widely-used library for image processing and computer vision tasks. It provides a wide range of functions for tasks such as image acquisition, image filtering, image transformation, object detection, and more.

2. [Pillow](https://pillow.readthedocs.io/en/stable/) is a fork of the Python Imaging Library (PIL), and is a widely-used library for working with image data in Python. It provides functions for tasks such as reading and writing image files, manipulating images, and applying image filters.

3. [Scikit-image](https://scikit-image.org/) is a library for image processing and computer vision tasks that is built on top of NumPy and SciPy. It provides a wide range of functions for tasks such as image filtering, image segmentation, image feature extraction, and more.

4. [Imutils](https://github.com/PyImageSearch/imutils) is a library that provides a set of convenience functions for tasks such as image resizing, image translation, and image rotation. It is built on top of OpenCV and is designed to make common image processing tasks easy to perform.

## Image Data for Teaching, Experiments and Research

![Olivetti_Faces](./images/olivetti.png)

(Image: A sample from Olivetti Faces Dataset.)

1. [MNIST](http://yann.lecun.com/exdb/mnist/)
1. [Extended MNIST](https://www.kaggle.com/datasets/crawford/emnist)
1. [Fashion MNIST](https://www.kaggle.com/datasets/zalando-research/fashionmnist)
1. [Kuzushiji-MNIST](https://github.com/rois-codh/kmnist)
1. [IMAGENET](https://image-net.org/update-mar-11-2021.php)
1. [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html)
3. [Olivetti Faces Dataset](https://scikit-learn.org/0.19/datasets/olivetti_faces.html)
4. [Labeled Faces in the Wild Dataset](http://vis-www.cs.umass.edu/lfw/)
5. [Large-scale CelebFaces Attributes (CelebA) Dataset](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)
5. [Face Recognition Technology (FERET)](https://www.nist.gov/programs-projects/face-recognition-technology-feret)
6. [iMaterialist Competition - Fashion](https://github.com/visipedia/imat_comp)
7. [DeepFashion2 Dataset](https://github.com/switchablenorms/DeepFashion2)
5. [102 Category Flower Dataset](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html)
6. [Comprehensive Plant Image Dataset](https://www.quantitative-plant.org/dataset)
3. [Caltech-UCSD Birds Dataset](https://vision.cornell.edu/se3/caltech-ucsd-birds-200/)
5. [The Oxford-IIIT Pet Image Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/)
7. [Stanford Dog Images Dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/)
8. [Fishnet Open Images Dataset](https://www.fishnet.ai/download)
9. [LEGO Bricks Image Dataset](https://www.kaggle.com/datasets/joosthazelzet/lego-brick-images)
1. [The Comprehensive Cars (CompCars) dataset](http://mmlab.ie.cuhk.edu.hk/datasets/comp_cars/index.html)
2. [Stanford Car Images Dataset](http://ai.stanford.edu/~jkrause/cars/car_dataset.html)
1. [LabelMe Dataset](http://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php)

![MNIST](./images/mnist-sample.png)


### Audio Data

Audio data is data that represents sound or audio, such as music, speech, or other sounds. Audio data is typically stored in a digital format, such as MP3, WAV, or AIFF, and can be played back using audio software or hardware.

#### A list of audio file formats

<p>(Source: <a
href="https://en.wikipedia.org/wiki/Audio_file_format">Wikipedia</a>)</p>
<table>
<thead>
<tr class="header">
<th>Extension</th>
<th style="text-align: left;">Explanation</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>.aac</code></td>
<td style="text-align: left;">The Advanced Audio Coding format is based
on the MPEG-2 and MPEG-4 standards. AAC files are usually ADTS or ADIF
containers.</td>
</tr>
<tr class="even">
<td><code>.aiff</code></td>
<td style="text-align: left;">A standard uncompressed CD-quality, audio
file format used by Apple. Established 3 years prior to Microsoft’s
uncompressed version wav.</td>
</tr>
<tr class="odd">
<td><code>.au</code></td>
<td style="text-align: left;">The standard audio file format used by
Sun, Unix and Java. The audio in au files can be PCM or compressed with
the μ-law, a-law or G729 codecs.</td>
</tr>
<tr class="even">
<td><code>.flac</code></td>
<td style="text-align: left;">A file format for the Free Lossless Audio
Codec, an open-source lossless compression codec.</td>
</tr>
<tr class="odd">
<td><code>.m4a</code></td>
<td style="text-align: left;">An audio-only MPEG-4 file, used by Apple
for unprotected music downloaded from their iTunes Music Store. Audio
within the m4a file is typically encoded with AAC, although lossless
ALAC may also be used.</td>
</tr>
<tr class="even">
<td><code>.m4b</code></td>
<td style="text-align: left;">An Audiobook / podcast extension with AAC
or ALAC encoded audio in an MPEG-4 container. Both M4A and M4B formats
can contain metadata including chapter markers, images, and hyperlinks,
but M4B allows “bookmarks” (remembering the last listening spot),
whereas M4A does not.</td>
</tr>
<tr class="odd">
<td><code>.m4p</code></td>
<td style="text-align: left;">A version of AAC with proprietary Digital
Rights Management developed by Apple for use in music downloaded from
their iTunes Music Store and their music streaming service known as
Apple Music.</td>
</tr>
<tr class="odd">
<td><code>.mmf</code></td>
<td style="text-align: left;">A Samsung audio format that is used in
ringtones. Developed by Yamaha (SMAF stands for “Synthetic music Mobile
Application Format”, and is a multimedia data format invented by the
Yamaha Corporation, .mmf file format).</td>
</tr>
<tr class="even">
<td><code>.mp3</code></td>
<td style="text-align: left;">MPEG Layer III Audio. It is the most
common sound file format used today.</td>
</tr>
<tr class="odd">
<td><code>.ogg</code>, <code>.oga</code>, <code>.mogg</code></td>
<td style="text-align: left;">A free, open source container format
supporting a variety of formats, the most popular of which is the audio
format Vorbis. Vorbis offers compression similar to MP3 but is less
popular. Mogg, the “Multi-Track-Single-Logical-Stream Ogg-Vorbis”, is
the multi-channel or multi-track Ogg file format.</td>
</tr>
<tr class="even">
<td><code>.opus</code></td>
<td style="text-align: left;">A lossy audio compression format developed
by the Internet Engineering Task Force (IETF) and made especially
suitable for interactive real-time applications over the Internet. As an
open format standardised through RFC 6716, a reference implementation is
provided under the 3-clause BSD license.</td>
</tr>
<tr class="odd">
<td><code>.raw</code></td>
<td style="text-align: left;">A raw file can contain audio in any format
but is usually used with PCM audio data. It is rarely used except for
technical tests.</td>
</tr>
<tr class="even">
<td><code>.wav</code></td>
<td style="text-align: left;">Standard audio file container format used
mainly in Windows PCs. Commonly used for storing uncompressed (PCM),
CD-quality sound files, which means that they can be large in
size—around 10 MB per minute. Wave files can also contain data encoded
with a variety of (lossy) codecs to reduce the file size (for example
the GSM or MP3 formats). Wav files use a RIFF structure.</td>
</tr>
<tr class="odd">
<td><code>.wma</code></td>
<td style="text-align: left;">Windows Media Audio format, created by
Microsoft. Designed with Digital Rights Management (DRM) abilities for
copy protection.</td>
</tr>
<tr class="even">
<td><code>.webm</code></td>
<td style="text-align: left;">Royalty-free format created for HTML5
video.</td>
</tr>
</tbody>
</table>

Audio data is similar to other types of time series data in that it is a sequence of data points collected over time. However, there are some characteristics of audio data that make it different from other types of time series data:

![Wavelet_Transform](./images/spectrogram.png)

(Image: [Source](https://www.researchgate.net/figure/The-result-of-the-continuous-wavelet-transform-Bluish-colors-represent-low-energy_fig6_258623566))

* Frequency: Audio data is typically collected at a much higher frequency than other types of time series data. For example, a typical audio file may be sampled at a rate of 44,100 samples per second, while other types of time series data may be sampled at much lower frequencies (e.g. hourly, daily, monthly).

* Amplitude: Audio data also has a different range of values than other types of time series data. Audio data is typically represented as a waveform, with the amplitude of the waveform representing the volume or intensity of the sound. The range of possible amplitudes in audio data is typically much larger than the range of values in other types of time series data.

* Spectral content: Audio data also has a different spectral content than other types of time series data. The spectral content of audio data refers to the distribution of energy across different frequencies, and is often represented using a spectrogram or other visualization. The spectral content of audio data can vary widely depending on the type of sound being recorded, and can be used to distinguish different types of sounds or music.

There are many libraries available for audio processing in Python. Some popular ones include:

1. [Librosa](https://librosa.org/doc/latest/index.html) is a library for audio and music analysis that provides a wide range of functions for tasks such as audio loading, audio feature extraction, and audio synthesis. It is built on top of [NumPy](https://numpy.org/) and [SciPy](https://scipy.org/), and is designed to be easy to use and extend.

2. [PyAudio](https://people.csail.mit.edu/hubert/pyaudio/) is a library for working with audio in Python that provides functions for tasks such as audio recording and playback, audio file input and output, and audio signal processing. It is built on top of [PortAudio](http://www.portaudio.com/), a cross-platform audio I/O library.

3. [scipy.signal](https://docs.scipy.org/doc/scipy/tutorial/signal.html) is the signal module in [SciPy](https://scipy.org/) that is designed for signal processing tasks, and provides functions for tasks such as filtering, convolution, and spectral analysis. It is built on top of [NumPy](https://numpy.org/) and is optimized for performance.

4. [pydub](https://github.com/jiaaro/pydub) is a library for working with audio in Python that provides functions for tasks such as audio file loading, audio file manipulation, and audio file output. It is built on top of [ffmpeg](https://ffmpeg.org/), and supports a wide range of audio file formats.

5. [soundfile](https://pysoundfile.readthedocs.io/en/latest/) is a library for reading and writing audio files in Python that provides functions for tasks such as audio file loading, audio file manipulation, and audio file output. It is built on top of [libsndfile](http://www.mega-nerd.com/libsndfile/), and supports a wide range of audio file formats.


Here are some examples of analyses that can be performed on sound data:

* Speech recognition 
* Music analysis
* Sound classification 
* Audio restoration
* Audio synthesis

![A_Chromagraph](./images/chromagraph.png)

Speech recognition is the process of converting spoken language into written text, and is often used in tasks such as voice-to-text transcription, voice commands, and language translation. Music analysis is the study and interpretion of music, and can be used to study patterns and trends in data such as music structure, melody, harmony, and rhythm. Sound classification is used for identifying the type or category of a sound, audio event detection, sound scene classification, and audio tagging. Audio restoration is about repairing or improving the quality of audio signals, and can be used to remove noise, improve clarity, or restore damaged audio. Finally, audio synthesis is about generating audio signals using algorithms or models, and can be used to create artificial sounds or to synthesize music.

## Data Visualization

Data visualization is an important aspect of data analysis and interpretation.  It allows us to explore and understand data in a visual format. There are different types of visualizations that we can use to represent data, each with its own strengths and limitations. Here are a few examples of common types of visualizations:

* Line plots: Line plots are used to visualize data that is measured over a continuous interval or time period. Line plots are useful for showing trends and changes in data over time.

![A_Line_Plot](./images/line-plot.png)

(Image: Average water level from Jan to Dec at Terkos Lake in Istanbul)

* Bar charts: Bar charts are used to visualize data that is divided into categories or groups. Bar charts are useful for comparing the size or frequency of different categories.

![A_Bar_Chart](./images/bar-chart.png)

* Pie charts: Pie charts are used to visualize data that is divided into categories or groups, and are useful for showing the relative sizes of different categories.

![A_Pie_Chart](./images/pie-chart.png)

* Scatter plots: Scatter plots are used to visualize the relationship between two numerical variables. Scatter plots are useful for showing trends and patterns in data.

![A_Scatter_Plot](./images/scatter-plot.png)

* Heat maps: Heat maps are used to visualize the values of a numerical variable across a grid of cells. Heat maps are useful for showing patterns and trends in data, and can be used to represent data that is structured in two dimensions.
    
![A_Heat_Map](./images/heat-map.png)

(Source: [Indifoot](https://www.indifoot.com/blog/what-is-a-heat-map-and-why-it-has-become-essential-in-the-current-sports-industry))
    
* Box plots: Box plots are used to visualize the distribution of a numerical variable. Box plots are useful for showing the range, median, and quartiles of a dataset.

![A_Box_Plot](./images/box-plot.png)

(Source: [Statistics Canada](https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214889-eng.htm))

* Maps: 

![A_Map](./images/turkish-crime.png)

(Image: Homicide rates in Turkey 100K people.)

### Open Real-Data Sources

Large part of doing data science is working with data: cleaning, understanding, filtering, and tranforming it. But in order to do that we need data. Unless you collect your own data, you will need to find interesting data sets that you can understand and ask questions about. Today, we are going to look at possible data sources and their uses.

An [application programming interface (API)](https://en.wikipedia.org/wiki/API) is a data connection between two pieces of software. For our purposes, it is a connection between a data consumer (you) and data provider.  Its primary function is **not** to provide data for human consumption, rather it is for exchanging data between two computer programs. In short, you'll use an API to fetch the data not to look at it in its raw form.

There are many open data sources that serve data via APIs, allowing you to programmatically access and retrieve data for your data science projects. Here are a few examples of open data sources that provide APIs:

#### Government Data Providers

* [UN Data](https://data.un.org/) 
* [World Bank Open Data platform](https://data.worldbank.org/) 
* [European Union Open Data Portal](https://data.europa.eu/en) 
* [IMF Data platform](https://data.imf.org/) 
* [European Central Bank](https://sdw.ecb.europa.eu/)
* [OECD data](https://data.oecd.org/)
* [US Government (data.gov)](https://data.gov) 
* [UK Government Data](https://www.data.gov.uk/)
* [US Central Bank (FED) data](https://fred.stlouisfed.org/)
* [Turkish Statistical Corporation TUIK](https://data.tuik.gov.tr/)
* [US Census Data](https://www.census.gov/data.html)

#### Municipality Data

* [Istanbul Municipality](https://data.ibb.gov.tr/)
* [Izmir Municipality](https://acikveri.bizizmir.com/)
* [Bursa Municipality](https://acikyesil.bursa.bel.tr/dataset/)
* [Athens Open Data](http://geodata.gov.gr/en/dataset)
* [Barcelona Municipality](https://opendata-ajuntament.barcelona.cat/)
* [London Data Store](https://data.london.gov.uk/developers/)
* [New York Open Data](https://opendata.cityofnewyork.us/)
* [City of Montreal Open Data](https://donnees.montreal.ca/collections)
* [City of Toronto Open Data](https://open.toronto.ca/)

#### Health Data

* [World Health Organization](https://www.who.int/data/gho/)
* [UK National Health System Data](https://digital.nhs.uk/)
* [US Health Data](https://healthdata.gov/)
* [US Food and Drug Administration (FDA)](https://www.fda.gov/food)
* [US National Cancer Institute](https://seer.cancer.gov/statistics-network/explorer/application.html)
* [US Centers for Disease Control](https://www.cdc.gov/datastatistics/)

#### Scientific Data

* [US Geographical Survey (USGS)](https://www.usgs.gov/products/data)
* [US National Aeronautics and Space Administration (NASA)](https://data.nasa.gov/)
* [European Space Agency (ESA)](https://www.esa.int/)
* [US National Oceanic and Atmospheric Administration (NOAA)](https://data.noaa.gov/datasetsearch/)
* [Open Science Data Cloud (OSDC)](https://www.opensciencedatacloud.org/)

#### Physiological Data

* [Open Neuro](https://openneuro.org/)
* [PhysioNet](https://physionet.org/about/database/)
* [Fall Detection Dataset](https://imvia.u-bourgogne.fr/en/database/fall-detection-dataset-2.html)
* [KFall Dataset](https://sites.google.com/view/kfalldataset)
* [MM Fit Dataset](https://mmfit.github.io/)
* [A collection of body accelerometer datasets](https://mobilize.stanford.edu/data/available-datasets/)

#### Sociological Survey Data

* [GSS Survey fro U. Chicago](https://gss.norc.org/)
* [PEW Research Center](https://www.pewresearch.org/internet/datasets/)
* [US National Survey of Families and Households](https://www.ssc.wisc.edu/nsfh/home.htm)
* [A collection of open survey data](https://hbl.gcc.libguides.com/soci377/data)

#### A List of GIS Data Sources

* [UN Geospatial Hub](https://geoservices.un.org/webapps/geohub/)
* [Natural Earth](https://www.naturalearthdata.com/)
* [Open Street Map](https://www.openstreetmap.org/)
* [Global Map](https://www.gsi.go.jp/kankyochiri/globalmap_e.html)
* [Libre Map Project](http://libremap.org/)
* [Open Topography Project](https://opentopography.org/)
* [Infrastructure for Spatial Information in the European Community](http://inspire.jrc.ec.europa.eu/)
* [European Environmental Agency](https://www.eionet.europa.eu/workspace/gis)
* [GeoSpacial Data from Government of Canada](http://www.geogratis.gc.ca/)

#### A List of Satellite Imagery Data Sources

* [NASA Earth Data](https://worldview.earthdata.nasa.gov/)
* [ESA Earth Data](https://earth.esa.int/eogateway/catalog)
* [USGS Earth Explorer](https://earthexplorer.usgs.gov/)
* [Copernicus Open Access Hub](https://scihub.copernicus.eu/dhus/#/home)
* [NOAA Earth Data](https://coast.noaa.gov/dataviewer/#)
* [Bhuvan Indian Geo-Platform of ISRO](https://bhuvan-app3.nrsc.gov.in/data/download/index.php)
* [Maxar Open Data](https://www.maxar.com/open-data)

#### Other data providers

* [Google Public Data Explorer](https://www.google.com/publicdata/directory) and [Google Public Data API Explorer](https://developers.google.com/apis-explorer) are platforms for exploring and visualizing public data sets, and provide links for accessing data in various formats (including CSV, JSON, and KML).
* [Humanitarian Data Exchange (HDX)](https://data.humdata.org/) is an open platform for sharing data across crises and organisations.
* [Data World](https://docs.data.world/index.html?lang=en) is a third party data provider and storage service.
* [DBpedia](https://www.dbpedia.org/) is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects.