# Data Exploration and Cleanup Process

## Initial Research

---

* [OpenSea API](https://docs.opensea.io/reference/api-overview) - The team could not get an API key from OpenSea, however we manually collected data from Open Sea's website for Top NFT Collections
* [OpenSea TestNet](https://docs.opensea.io/reference/rinkeby-api-overview) - This API did not require a key, however it only interfaces with the Rinkeby testnet, and the NFT collections we were interested in analyzing only exist on the Ethereum mainnet
* [Rarify API](https://docs.rarify.tech/reference) - The main data source for the project. Discovered their API interfaces with OpenSea's API
* [NonFungible.com](https://nonfungible.com/market-tracker) - General NFT Market data, however the data is only provided as CSV files
* [Nansen](https://www.nansen.ai/) - The premeire NFT data provider. Their API key required a substantial monetary subscription. However, the team used Nansen's application to manually compare NFT baskets to Nansen's NFT indexes
* [Twitter API](https://developer.twitter.com/en/docs/twitter-api) - Provided access to perform a sentiment analysis of various NFT collections

## Datasets Used

---

Based on subscription requirements, data restrictions, and file formats we decided to only use Rarify's API to collect data for the following reasons:

1. Real-time data
2. Interfaced with OpenSea
3. Free to use

## Database Construction

---

The team decided to collect data from Rarify's API and store it locally in an AWS Postgres Cloud database. 

1. Advantages to storing the data locally:

    * Data stored locally provides faster responses and ease of access
    * Team members were not limited by API call restrictions
    * ERD shows the relationships between collections & tokens, traits, and trade transactions on the blockchain for each collection
    * All the data from different sources was stored in one location, allowing the same request method through the contract addresses for each NFT collection
    
### Entity Relationship Diagram (ERD)
![ERD FILE](https://github.com/jgrichardson/nft_lending/blob/final_submission/images/NFT%20Lending_ERD%20v1_4.jpeg?raw=true)


2. Rarify API provides the following data:
    * Top 100 collections based on all time volume traded
    * Trade transactions
    * Top 100 tokens for each collection
    * Traits for each collection and their rarity score
    
3. Extract Transform and Load Process (ETL)
    * etl.py runs nightly to extract data from the rarify API and stores the data into the database
    * Two Python files in the database directory (ddl.py and dml.py) create the database tables as well as the default system data used within the application


4. SQL Queries
    * SQL queries are used to extract data for analysis
    
5. CSV files
    * Some CSV files are used as datasets in order to simplify the code for this project, it is the team's goal to finish converting all data sources to SQL based queries referencing the database

## Data Clean Up Process

---

1. The data provided by the Rarify API was initially cleaned and stored in eight SQL tables
2. Data was filtered and cleaned from the database by individual team members once converted to Pandas DataFrames