diff --git a/modules/module5/BasicEntrezQuery.ipynb b/modules/module5/BasicEntrezQuery.ipynb new file mode 100644 index 0000000..504b302 --- /dev/null +++ b/modules/module5/BasicEntrezQuery.ipynb @@ -0,0 +1,46 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from Bio import Entrez\n", + "Entrez.email = \"brian.chapman@utah.edu\" # Always tell NCBI who you are\n", + "handle = Entrez.efetch(db=\"nucleotide\", id=\"186972394\", rettype=\"gb\", retmode=\"text\")\n", + "result = handle.read()\n", + "handle.close()\n", + "print(result)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/modules/module5/PandasAndFolium.ipynb b/modules/module5/PandasAndFolium.ipynb new file mode 100644 index 0000000..bfa24b7 --- /dev/null +++ b/modules/module5/PandasAndFolium.ipynb @@ -0,0 +1,394 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Visualizing Spatial Data with Pandas and Folium" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "DATADIR = os.path.join(os.path.expanduser(\"~\"),\"DATA\",\n", + " \"Misc\")\n", + "print(os.path.exists(DATADIR))\n", + "import pandas as pd\n", + "import numpy as np" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install folium\n", + "import folium\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`Accidents7904.csv` located in `~/DATA/Misc` is a a record of all the automobile accidents in the UK between 1974 and 2004. This is quite a large data set but nothing that Pandas can't handle, in principle. However, given that we don't want to over tax our system, we will limit ourselves to reading in only parts of the data.\n", + "\n", + "The original data contains 6224198 rows. However, because GPS was not declassified until the late 1990s, the early accidents do not have lattitude and longitude values are so not of interest to us. The first longitude/lattitude value occurs at row 4883216.\n", + "\n", + "We can use the [`skiprows`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html) keyword. \n", + "\n", + "`skiprows` can take\n", + "* An integer number of rows to skip\n", + "* A sequence (e.g. a list) of row numbers to skip\n", + "* Or a function that returns `True` if the row should be skipped and `False` otherwise." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Read in the data\n", + "\n", + "We'll use a `lambda` function to specify which rows to skip" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = pd.read_csv(os.path.join(DATADIR, \"Accidents7904.csv\"),\n", + " skiprows = lambda index: index >0 and index <=4883216 \n", + " \n", + " )#.dropna()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What are our columns?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What are the values in these columns?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data['Accident_Severity'].unique()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[\"Number_of_Casualties\"].unique()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[\"Light_Conditions\"].unique()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Let's limit ourselves to the following columns:\n", + "\n", + "* `Longitude`\n", + "* `Latitude`\n", + "* `Time`\n", + "* `Number_of_Casualites`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = pd.read_csv(os.path.join(DATADIR, \"Accidents7904.csv\"),\n", + " usecols=['Longitude',\"Latitude\",\n", + " \"Date\", \"Time\",\"Number_of_Casualties\"],\n", + " skiprows = lambda index: index >0 and index <=4883216 )\n", + " \n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### We can drop missing values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data2 = data.dropna()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data2.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "type(data2.iloc[0,3])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "type(data2.loc[0,\"Time\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Date's and Times are not recognized as such and so are left as strings\n", + "\n", + "* We could set `locale`\n", + "* Or we can convert later" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data2[\"Date\"] = pd.to_datetime(data2[\"Date\"],format=\"%d/%m/%Y\", \n", + " errors='ignore')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "tmp = datetime.strptime(\"09:30\",\"%H:%M\")\n", + "print(tmp.time())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data2[\"Time\"] = data2.apply(lambda row: datetime.strptime(row[\"Time\"],\"%H:%M\").time(), \n", + " axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data2.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### We can use the [``sample``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) method to get a subset of DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subdata = data2.sample(2000)\n", + "mean_long = np.mean(subdata['Longitude'])\n", + "mean_lat = np.mean(subdata['Latitude'])\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(folium.Map)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "map = folium.Map(location=[mean_lat, mean_long], \n", + " tiles=\"Stamen Terrain\", zoom_start=5.5)\n", + "for _, s in subdata.iterrows():\n", + " rslt = folium.Marker([s[\"Latitude\"], s[\"Longitude\"]],\n", + " popup=\"%s\\n%s\\n# Causalities: %d\"%(s[\"Date\"],\n", + " s[\"Time\"],\n", + " s[\"Number_of_Casualties\"]),\n", + " icon=folium.Icon(icon='cloud')).add_to(map)\n", + "map" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example 2\n", + "\n", + "* Filter Pandas DataFrame on number of casualties\n", + "* Select different [Bootstrap icon](https://www.w3schools.com/icons/bootstrap_icons_glyphicons.asp)\n", + "* Set different color" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from ipywidgets import interact, interactive, fixed, interact_manual, IntSlider\n", + "import ipywidgets as widgets\n", + "from IPython.display import display" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(folium.Map)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "subdata = data2.sample(100)\n", + "mean_long = np.mean(subdata['Longitude'])\n", + "mean_lat = np.mean(subdata['Latitude'])\n", + "tiles = [\"OpenStreetMap\", \"Mapbox Bright\", \"Mapbox Control Room\", \n", + " \"Stamen Terrain\", \"Stamen Toner\", \"Stamen Watercolor\", \n", + " \"CartoDB positron\",\"CartoDB dark_matter\"]\n", + "@interact(num_cas=IntSlider(min=1,\n", + " max=subdata.Number_of_Casualties.max(), \n", + " value=subdata.Number_of_Casualties.max()), \n", + " data2 = fixed(subdata), \n", + " loclat = fixed(mean_lat), \n", + " tile=tiles,\n", + " loclon=fixed(mean_long))\n", + "def plot_accidents(data2, num_cas, loclat, loclon, tile):\n", + " map2 = folium.Map(location=[loclat, loclon], \n", + " tiles=tile, zoom_start=5.5)\n", + " for _, s in data2[data2[\"Number_of_Casualties\"]>=num_cas].iterrows():\n", + " rslt = folium.Marker([s[\"Latitude\"], s[\"Longitude\"]],\n", + " popup=\"%s\\n%s\\n# Causalities: %d\"%(s[\"Date\"],\n", + " s[\"Time\"],\n", + " s[\"Number_of_Casualties\"]),\n", + " icon=folium.Icon(icon=\"fa-ambulance\", color='red', prefix=\"fa\"),\n", + " tooltip = 'Click for accident details').add_to(map2)\n", + " display(map2)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/modules/module5/Resources/case005.png b/modules/module5/Resources/case005.png new file mode 100644 index 0000000..7b31fee Binary files /dev/null and b/modules/module5/Resources/case005.png differ diff --git a/modules/module5/Resources/disease_graphs.png b/modules/module5/Resources/disease_graphs.png new file mode 100644 index 0000000..2f0335d Binary files /dev/null and b/modules/module5/Resources/disease_graphs.png differ diff --git a/modules/module5/Resources/mainMail0075.png b/modules/module5/Resources/mainMail0075.png new file mode 100644 index 0000000..33daf43 Binary files /dev/null and b/modules/module5/Resources/mainMail0075.png differ diff --git a/modules/module5/authorship_networks.ipynb b/modules/module5/authorship_networks.ipynb new file mode 100644 index 0000000..6085055 --- /dev/null +++ b/modules/module5/authorship_networks.ipynb @@ -0,0 +1,865 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Graph Relationships Among University of Utah Faculty\n", + "### © Brian E. Chapman, Ph.D.\n", + "\n", + "In this notebook, we will create graphs that shows co-authorship relationships between faculty from either Biomedical Informatics or Human Genetics at the University of Utah. Co-authorship is determined by querying the Pubmed database using the [Biopython](https://github.com/biopython/biopython.github.io/) [Entrez](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html) subpackage. Graphs are created using [NetworkX](https://networkx.github.io/)\n", + "\n", + "\n", + "### Key Concepts\n", + "#### These concepts and applications might be new to the student.\n", + "\n", + "* Graphs\n", + "* Gzip\n", + "* Pickle\n", + "* [set](https://docs.python.org/3.5/tutorial/datastructures.html#sets)\n", + "* Graphviz" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Querying the Entrez Database\n", + "\n", + "In this notebook we will be querying papers in the PubMed database. For an example of querying the nucleotide database for a sequence click [here](./BasicEntrezQuery.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The notebook Allows the User to Either create New Pubmed data or use cached results\n", + "\n", + "#### Set ``GET_NEW_DATA`` to True to query Entrez" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "GET_NEW_DATA = True" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from Bio import Entrez\n", + "import networkx as nx\n", + "import os\n", + "DATADIR = os.path.join(os.path.expanduser(\"~\"),\"DATA\", \"Graphs\")\n", + "if not GET_NEW_DATA:\n", + " DATADIR = os.path.join(os.path.expanduser(\"~\"), \"work\", \"graphs\")\n", + " if not os.path.exists(DATADIR):\n", + " os.makedirs(DATADIR)\n", + " \n", + "print(os.path.exists(DATADIR))\n", + "from IPython.display import Image\n", + "import gzip\n", + "import pickle\n", + "import numpy as np\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Faculty Names\n", + "\n", + "* Human Genetic faculty names were scrapped from the department website\n", + "* Biomedical Informatics faculty names were taken from training grant materials\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hg_faculty = ['Mario Capecchi',\n", + " 'Richard Cawthon',\n", + " 'Clement Chow',\n", + " 'Nels Elde',\n", + " 'Cedric Feschotte',\n", + " 'David J. Grunwald',\n", + " 'Sandra J. Hasstedt',\n", + " 'Michael T. Howard',\n", + " 'Lynn B. Jorde',\n", + " 'Gabrielle Kardon',\n", + " 'Kristen M. Kwan',\n", + " 'Mark F. Leppert',\n", + " 'Anthea Letsou',\n", + " 'Suzanne L. Mansour',\n", + " 'Gabor Marth',\n", + " 'Mark M. Metzstein',\n", + " 'Charles Murtaugh',\n", + " 'Ellen J. Pritham',\n", + " 'Aaron Quinlan',\n", + " 'Shigeru Sakonju',\n", + " 'Gillian Stanfield',\n", + " 'Louisa Stark',\n", + " 'Carl S. Thummel',\n", + " 'Robert B. Weiss',\n", + " 'Mark Yandell']\n", + "bmi_faculty = ['Samir E AbdelRahman',\n", + " 'Bruce E Bray',\n", + " 'Wendy W Chapman',\n", + " 'Michael A Conway',\n", + " 'Guilherme Del Fiol',\n", + " 'Karen Eilbeck',\n", + " 'R. Scott Evans',\n", + " 'Julio C Facelli',\n", + " 'Bryan S Gibson',\n", + " 'Ramkiran Gouripeddi',\n", + " 'Peter J Haug',\n", + " 'Rachel Hess',\n", + " 'Stanley M Huff',\n", + " 'John F Hurdle',\n", + " 'Jonathan Nebeker',\n", + " 'Kensaku Kawamoto',\n", + " 'Younghee Lee',\n", + " 'Scott P Narus',\n", + " 'Aaron Quinlan',\n", + " 'Charlene R Weir']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Functions for Querying Entrez" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def join_first_last(name):\n", + " return \" \".join(split_name(name))\n", + "def split_name(name):\n", + " tmp = name.split()\n", + " return tmp[0],tmp[-1]\n", + "def check_author_faculty(author, faculty):\n", + " a = split_name(author)\n", + " \n", + " for f in faculty:\n", + " if a[0] in f and a[1] in f:\n", + " return True\n", + " return False" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Get the pubmed IDs matching query" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def search(query, email = None):\n", + " if email == None:\n", + " raise ValueError(\"Must provde a valid e-mail\")\n", + " Entrez.email = email\n", + " handle = Entrez.esearch(db='pubmed', \n", + " sort='relevance', \n", + " retmax='100',\n", + " retmode='xml', \n", + " term=query)\n", + " results = Entrez.read(handle)\n", + " return results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Fetch papers corresponding to ids" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def fetch_details(id_list, email=None, retmode='json'):\n", + " if email == None:\n", + " raise ValueError(\"Must provde a valid e-mail\")\n", + " ids = ','.join(id_list)\n", + " Entrez.email = email\n", + " handle = Entrez.epost(db='pubmed',\n", + " id=ids)\n", + " results = Entrez.read(handle)\n", + " webenv = results[\"WebEnv\"]\n", + " query_key = results[\"QueryKey\"]\n", + " fetch_handle = Entrez.efetch(db='pubmed', \n", + " rettype='xml',\n", + " retmode=retmode,\n", + " webenv=webenv, \n", + " query_key=query_key)\n", + " return [r for r in Entrez.read(fetch_handle, validate=False)[\"PubmedArticle\"]]\n", + " return fetch_handle.read()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Get Co-authorship\n", + "\n", + "Entrez returns a lot of information. We hone it down to just the names. We need to use exceptions because the returned papers doesn't always have the fields we want." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# help(Entrez.efetch)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def get_coauthor_lists(papers):\n", + " paper_authors = {}\n", + " for p in papers:\n", + " try:\n", + " tmp = p['MedlineCitation']\n", + " alist = []\n", + " for a in tmp['Article']['AuthorList']:\n", + " try:\n", + " s = \"%s %s\"%(a['ForeName'],a['LastName'])\n", + " alist.append(s)\n", + " except Exception as error:\n", + " pass\n", + " paper_authors[tmp['Article']['ArticleTitle']] = alist\n", + " except:\n", + " pass\n", + " return paper_authors" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "def get_faculty_coauthors(faculty):\n", + " return get_coauthor_lists( \n", + " fetch_details(search(faculty)['IdList']))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def get_faculty_coauthorship(faculty, email=None):\n", + " if email == None:\n", + " raise ValueError(\"Must provide valid e-mail\")\n", + " faculty_ids = {f:search(f, email=email)['IdList'] for f in faculty}\n", + " faculty_details_text = {f:fetch_details(ids, retmode='xml', email=email) for f, ids in faculty_ids.items() if ids}\n", + " coauthors = {f:get_coauthor_lists(ad) for f, ad in faculty_details_text.items()}\n", + " return coauthors" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if GET_NEW_DATA:\n", + " hg_coauthorship = get_faculty_coauthorship(hg_faculty, email=\"brian.chapman@utah.edu\")\n", + " bmi_coauthorship = get_faculty_coauthorship(bmi_faculty, email=\"brian.chapman@utah.edu\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if GET_NEW_DATA:\n", + " with gzip.open(os.path.join(DATADIR, \"hg_coauthorship.pickle.gz\"), \"wb\") as f0:\n", + " pickle.dump(hg_coauthorship, f0)\n", + "\n", + " with gzip.open(os.path.join(DATADIR, \"bmi_coauthorship.pickle.gz\"), \"wb\") as f0:\n", + " pickle.dump(bmi_coauthorship, f0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hg_coauthorship" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Local Data\n", + "\n", + "The connection with the Entrez database can be problematic. If you are having difficulty try loading the data I have previously generated. The files are stored in compressed ([gzip](https://en.wikipedia.org/wiki/Gzip)) [pickle](https://docs.python.org/3/library/pickle.html) files. Gzip is a common compression algorithm while pickle is a Python-specific format for writing Python objects to disk. gzipped files can be opened directly by Python and then treated like a normal file. See the documentation for the Python [gzip library](https://docs.python.org/3/library/gzip.html). When I generated the files I used the [``pickle.dump``](https://docs.python.org/3/library/pickle.html#pickle.dump) function to store ``hg_coauthorship`` and ``bmi_coauthorship`` in individual files.\n", + "\n", + "Read the documentation to figure out how to **load** the data back into a Python program." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if not GET_NEW_DATA:\n", + " with gzip.open(os.path.join(DATADIR, \"hg_coauthorship.pickle.gz\"), \"rb\") as f0:\n", + " hg_coauthorship = pickle.load(f0)\n", + " with gzip.open(os.path.join(DATADIR, \"bmi_coauthorship.pickle.gz\"), \"rb\") as f0:\n", + " bmi_coauthorship = pickle.load(f0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create Undirected Graphs of Co-authorships\n", + "### We will limit nodes of the graphs to faculty of the department\n", + "\n", + "* If we do not add a co-author to the graph we will add them to a list\n", + "* We will start by creating graphs where we check the author names return from PubMed against the list of faculty names we defined at the top of the notebook.\n", + " * We can create either a NetworkX [MultiGraph](https://networkx.github.io/documentation/development/reference/classes.multigraph.html?highlight=multigraph) or a [Graph](https://networkx.github.io/documentation/development/reference/classes.graph.html). A MultiGraph allows for multiple edges (relationships) between nodes (authors), while a Graph allows for only a single edge between two nodes.\n", + " * What might be the potential uses of both styles of graphs in an analysis of co-authorships?\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_authorship_graph_naive(coauthors, faculty, graph_type = \"multi\"):\n", + " \"\"\"\n", + " \n", + " \"\"\"\n", + " if graph_type == \"multi\":\n", + " authorship = nx.MultiGraph()\n", + " else:\n", + " authorship = nx.Graph()\n", + " \n", + " not_added = []\n", + " faculty_tuples = [split_name(x) for x in faculty]\n", + " for author, papers in coauthors.items():\n", + " for title, authors in papers.items():\n", + " for a in authors:\n", + " if a != author:\n", + " if a in faculty:\n", + " authorship.add_edge(author, \n", + " a, key=title, attr_dict={\"paper\":title})\n", + " else:\n", + " not_added.append(a)\n", + " return authorship, not_added" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Notes\n", + "\n", + "* ``_``: an underscore is commonly used for a throwaway variable" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mgraphn_hg, not_added_n_hg = create_authorship_graph_naive(hg_coauthorship, hg_faculty)\n", + "graphn_hg, _ = create_authorship_graph_naive(hg_coauthorship, hg_faculty, graph_type = \"graph\")\n", + "\n", + "mgraphn_bmi, not_added_n_bmi = create_authorship_graph_naive(bmi_coauthorship, bmi_faculty)\n", + "graphn_bmi, _ = create_authorship_graph_naive(bmi_coauthorship, bmi_faculty, graph_type = \"graph\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Explore the graphs\n", + "\n", + "NetworkX has functions for basic descriptions of graphs, such as the number of nodes and number of edges. There are also functions for characterizing the nodes of a graphs, such as the degree (the number of edges connected to a node. We can draw the graphs with matplotlib \n", + "\n", + "#### Explore the following drawing functions\n", + "* ``nx.draw``\n", + "* ``nx.draw_spring``\n", + "* ``nx.draw_spectral``\n", + "* ``nx.draw_circular``" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(len(graphn_hg))\n", + "print(graphn_hg.number_of_nodes(), graphn_hg.number_of_edges())\n", + "nx.draw(graphn_hg, with_labels=True, alpha=0.4, font_size=8)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(len(graphn_bmi))\n", + "print(graphn_bmi.number_of_nodes(), graphn_bmi.number_of_edges())\n", + "nx.draw_circular(graphn_bmi, with_labels=True, alpha=0.4, font_size=8)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Draw the Graphs with Graphviz\n", + "\n", + "While NetworkX comes with a built in Matplotlib drawing interface, it takes a lot of customization to make owrthwhile figures. Instead we will use NetworkX's interface to [graphviz](http://www.graphviz.org/) to generate figures. We will have to save the figures to disk and then use the IPython notebook [``display.Image``](http://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html?highlight=display.image#functions) function to draw the figures in the notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def save_graph(g, name):\n", + " \"\"\"\n", + " \"\"\"\n", + " ag = nx.nx_pydot.to_pydot(g)\n", + " fname = name+\".png\"\n", + " ag.write_png(fname)\n", + " return fname" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Image(save_graph(mgraphn_hg,\"mgraphn_hg\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Image(save_graph(graphn_hg,\"graphn_hg\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Image(save_graph(mgraphn_bmi,\"mgraphn_bmi\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Image(save_graph(graphn_bmi,\"graphn_bmi\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Examine Our Data\n", + "\n", + "The Human Genetics graph seems suspiciously small. We should look at whom we didn't add.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Number of authors not added to the graph:\", len(not_added_n_hg))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Authors may not have been added to the graph multiple times (a non-faculty author on multiple paper). We can use a Python [set](https://docs.python.org/3.5/tutorial/datastructures.html#sets) to get the unique set of authors not added. For more information about sets click [here](./sets_and_python.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "len(set(not_added_n_hg))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The set of unique authors not added is about half the size of the not added list. This list would consist of students, collaborators from other departments and universities, etc. But is there any chance we didn't add someone we should have?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise\n", + "\n", + "How would you check whether Mark Yandell was ever incorrectly **not** added?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating a Better Graph Generation Function\n", + "\n", + "One of the challenges we have with our graph generation is that we do not consistently use names. Sometimes I'm \"Brian Chapman\" other times I'm \"Brian E. Chapman\". The same could be true of other authors. If the web page from which I got the faculty names uses names different than how the faculty names appear in PubMed, we would have problems. \n", + "\n", + "The most likely problem is with middle names so we could write a function that extracts the first and last name from a name string and we could write another function that checks first names and last names against the list of faculty names (using only the first and last names from that list)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using our first/last name checking algorithm we can create a better graph generation function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_authorship_graph(coauthors, faculty, graph_type = \"multi\"):\n", + " \"\"\"\n", + " \n", + " \"\"\"\n", + " if graph_type == \"multi\":\n", + " authorship = nx.MultiGraph()\n", + " else:\n", + " authorship = nx.Graph()\n", + " \n", + " not_added = []\n", + " faculty_tuples = [split_name(x) for x in faculty]\n", + " for author, papers in coauthors.items():\n", + " for title, authors in papers.items():\n", + " for a in authors:\n", + " if join_first_last(a) != join_first_last(author):\n", + " if check_author_faculty(a, faculty_tuples):\n", + " authorship.add_edge(join_first_last(author), \n", + " join_first_last(a), key=title, attr_dict={\"paper\":title})\n", + " else:\n", + " not_added.append(a)\n", + " return authorship, not_added" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mgraph_hg, not_added_hg = create_authorship_graph(hg_coauthorship, hg_faculty)\n", + "graph_hg, _ = create_authorship_graph(hg_coauthorship, hg_faculty, graph_type = \"graph\")\n", + "\n", + "mgraph_bmi, not_added_bmi = create_authorship_graph(bmi_coauthorship, bmi_faculty)\n", + "graph_bmi, _ = create_authorship_graph(bmi_coauthorship, bmi_faculty, graph_type = \"graph\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Examine Our Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Number of authors not added to the graph:\", len(set(not_added_hg)))\n", + "print(print(mgraph_hg.number_of_nodes(), mgraph_hg.number_of_edges())\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Number of authors not added to the graph:\", len(set(not_added_bmi)))\n", + "print(print(mgraph_bmi.number_of_nodes(), mgraph_bmi.number_of_edges()))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Image(save_graph(mgraph_hg,\"mgraph_hg\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Image(save_graph(graph_hg,\"graph_hg\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Image(save_graph(mgraph_bmi,\"mgraph_bmi\"))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Image(save_graph(graph_bmi,\"graph_bmi\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise \n", + "\n", + "* Jonathan Nebeker was incorrectly added as a primary faculty in Biomedical Informatics. How can you delete Dr. Nebeker from the graph along with all edges connecting to him?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise: How Do the Graphs Compare?\n", + "\n", + "* What are the average clustering of the graphs?\n", + "* What are the diameter of the graphs?\n", + "* What are the average shortest paths of the graphs?\n", + "\n", + "**Hint:** Functions for each of these questions are provided by NetworkX can be found in the [documentation](http://networkx.readthedocs.io/en/stable/index.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def get_graph_degrees(g):\n", + " gd = [(n,g.degree(n)) for n in g.nodes()]\n", + " gd.sort(key = lambda x: x[1], reverse = True)\n", + " return gd" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def print_graph_node_degree_value(g):\n", + " for x in get_graph_degrees(g):\n", + " print(\"%s%s\"%((\"%s\"%x[0]).ljust(30),(\"% 2d\"%x[1]).rjust(5)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_graph_node_degree_value(mgraph_bmi)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_graph_node_degree_value(mgraph_hg)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_graph_node_degree_value(graph_hg)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_graph_node_degree_value(graph_bmi)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Creative
University of Uah Data Science for Health by Brian E. Chapman is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/modules/module5/intro_to_graphs.ipynb b/modules/module5/intro_to_graphs.ipynb new file mode 100644 index 0000000..9d8b0df --- /dev/null +++ b/modules/module5/intro_to_graphs.ipynb @@ -0,0 +1,236 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "internals": { + "slide_helper": "subslide_end", + "slide_type": "subslide" + }, + "slide_helper": "slide_end", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Using Graphs to Model Linked Data\n", + "#### © Brian E. Chapman, PhD\n", + "\n", + "In this module we will learn about how to model graph data with [NetworkX](http://networkx.readthedocs.io/en/latest/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "internals": { + "slide_type": "subslide" + }, + "slideshow": { + "slide_type": "slide" + } + }, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "internals": { + "slide_helper": "subslide_end" + }, + "slide_helper": "slide_end", + "slideshow": { + "slide_type": "skip" + } + }, + "outputs": [], + "source": [ + "import os\n", + "DATADIR = os.path.join(os.path.expanduser('~'), \"DATA\")\n", + "\n", + "import networkx as nx\n", + "import csv\n", + "import imaplib\n", + "import getpass\n", + "import email\n", + "from collections import defaultdict\n", + "from IPython.display import Image\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "internals": { + "slide_helper": "subslide_end", + "slide_type": "subslide" + }, + "slide_helper": "subslide_end", + "slideshow": { + "slide_type": "slide" + } + }, + "source": [ + "# Graphs\n", + "\n", + "\n", + "* Graphs are a data representation consisting of **nodes** and **edges**\n", + "* Nodes are entities\n", + "* Edges are relationships\n", + "* Examples\n", + " * Text:\n", + " * Nodes are words in sentence (e.g. findings, modifiers, conjuntions)\n", + " * Edges are relationships between the words\n", + " * Images:\n", + " * Nodes are antatomic features (e.g. bifurcations)\n", + " * Edges are adjacency.paths between features (e.g. vessels)\n", + " * Social Networks\n", + " * Nodes are people\n", + " * Edges are relationships (e.g. friendship, coauthorship)\n", + " * Physiology\n", + " * Brain connectivity\n", + " * Metabolic pathways\n", + " * Ontologies\n", + " \n", + "## Example Graphs\n", + "### Word Relationships\n", + "![word relationships](./Resources/case005.png)\n", + "\n", + "### An *undirected* graph based on e-mails\n", + "![email graph](./Resources/mainMail0075.png)\n", + "\n", + "### A *directed* graph from the human disease ontology\n", + "![example disease graph](./Resources/disease_graphs.png)\n", + " \n", + "## Python Graph Packages\n", + "\n", + "* [NetworkX:](http://networkx.github.io/) this is a very popular, easy to use package. Its advantage and disadvantage is that it is pure Python. Conseqeuntly, easy to use but relatively slow.\n", + "* [graph-tool:](https://graph-tool.skewed.de/) \"Despite its nice, soft outer appearance of a regular python module, the core algorithms and data structures of graph-tool are written in C++, with performance in mind. Most of the time, you can expect the algorithms to run just as fast as if graph-tool were a pure C/C++ library.\"\n", + "* [python-igraph:](http://igraph.org/python/) \"igraph is a collection of network analysis tools with the emphasis on efficiency, portability and ease of use. igraph is open source and free. igraph can be programmed in R, Python and C/C++.\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "internals": { + "slide_helper": "subslide_end", + "slide_type": "subslide" + }, + "slide_helper": "subslide_end", + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "# [NetworkX](http://networkx.github.io/)\n", + "* Graphs (networkx.Graph())\n", + " * Edges (relationships) have no directionality\n", + "* Directional Graphs (networkx.DiGraph())\n", + " * Edges (relationships) have directionality\n", + "* MultiGraphs (networkx.MultiGraph(), networkx.MultiDiGraph() )\n", + " * There can be multiple edges between nodes \n", + "* Graphs, nodes, and edges can all have attributes (dictionaries)\n", + " * Each node has a label\n", + " * Each node also has a dictionary (possibly empty) of attributes\n", + " * Each edge also has a label (the node labels defining the beginning and ending of the edge) \n", + " * Each edge also has a dictionary (possibly empty) of attributes\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating graphs is a matter of adding nodes and edges\n", + "\n", + "* If we add an edge it will add a node, if needed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import networkx as nx\n", + "\n", + "informatics = nx.DiGraph()\n", + "informatics.add_node(\"Homer Warner\")\n", + "informatics.add_node(\"Paul Clayton\")\n", + "informatics.add_edge(\"Homer Warner\",\"Reed Gardner\")\n", + "informatics.add_edge(\"Homer Warner\", \"Al Pryor\")\n", + "informatics.add_edge(\"Al Pryor\", \"Dennis Parker\")\n", + "informatics.add_edge(\"Dennis Parker\", \"Brian Chapman\")\n", + "informatics.add_edge(\"Brian Chapman\", \"Holly Perry\")\n", + "informatics.add_edge(\"Peter Haug\",\"Wendy Chapman\")\n", + "informatics.add_edge(\"Wendy Chapman\", \"Jeannie Irwin\")\n", + "nx.draw_spring(informatics, with_labels=True, alpha=0.3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "internals": { + "slide_helper": "subslide_end", + "slide_type": "subslide" + }, + "slide_helper": "subslide_end", + "slideshow": { + "slide_type": "subslide" + } + }, + "source": [ + "## Some NetworkX Notebooks\n", + "* [From *Learning IPython for Interactive Computing and Data Visualization*](http://nbviewer.ipython.org/github/ipython-books/minibook-code/blob/master/chapter2/203-networkx.ipynb)\n", + "* [Twitter Data](http://nbviewer.ipython.org/gist/ellisonbg/3837783/TwitterNetworkX.ipynb)\n", + "* [NetworkX Basics](https://www.wakari.io/sharing/bundle/nvikram/Basics%20of%20Networkx?has_login=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Further Reading\n", + "[Here is a brief course on graphs and Python](http://www.python-course.eu/graphs_python.php)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Creative
University of Uah Data Science for Health by Brian E. Chapman is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/modules/python_time/bp_analysis_1.ipynb b/modules/python_time/bp_analysis_1.ipynb new file mode 100644 index 0000000..d5f6bc6 --- /dev/null +++ b/modules/python_time/bp_analysis_1.ipynb @@ -0,0 +1,361 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Analyzing OR Blood Pressure Measurements" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import os\n", + "import time\n", + "import datetime\n", + "import numpy as np\n", + "DATADIR = os.path.join(os.path.expanduser('~'),\"DATA\", \"TimeSeries\", \"UofUData\")\n", + "os.path.exists(DATADIR)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = pd.read_csv(os.path.join(DATADIR,\"data_all.csv\"), nrows=1000)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "type(data[\"noninvDIA\"][0])\n", + "np.nan" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "t0 = data[\"VirtualDateTime\"][0]\n", + "print(t0)\n", + "print(type(t0))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Convert ``VirtualDataTime`` from string to datetime\n", + "\n", + "### Define parse string" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "parse_str = \"%Y-%m-%d %H:%M:%S\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Test parse string" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(time.strftime(parse_str, time.localtime()))\n", + "time.strptime(t0, parse_str)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create datetime" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(datetime.datetime.strptime(t0, parse_str))\n", + "print(type(datetime.datetime.strptime(t0, parse_str)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def datestring_to_datetime(s, parse_str):\n", + " try:\n", + " return datetime.datetime.strptime(s, parse_str)\n", + " except:\n", + " return np.nan" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "type(datestring_to_datetime(t0, parse_str))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Modify DataFrame" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[\"VirtualDateTime Parsed\"] = \\\n", + "data.apply(lambda x: datetime.datetime.strptime(x[\"VirtualDateTime\"], parse_str), axis=1)\n", + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[\"VirtualDateTime Parsed\"][0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Alternatively we can do simple date conversions using Pandas ``to_datetime`` function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[\"VirtualDateTime Parsed2\"] = pd.to_datetime(data[\"VirtualDateTime\"], dayfirst=True)\n", + "print(type(data[\"VirtualDateTime Parsed2\"][0]))\n", + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[\"VirtualCaseID\"].unique()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sumbdata = data.dropna().head()#[\"invDIA\"].plot()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[data[\"VirtualCaseID\"]==10349].plot(x=\"VirtualDateTime Parsed\", \n", + " y=[\"invSYS\", \"invMAP\", \"invDIA\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Do we need to drop data\n", + "#### Explore ``dropna`` with different values for ``how`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[data[\"VirtualCaseID\"]==10349].dropna(how=\"all\").plot(x=\"VirtualDateTime Parsed\", \n", + " y=[\"invSYS\", \"invMAP\", \"invDIA\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Cleansing Filters\n", + "#### Consider the following two criteria for considering a measurement as spurious\n", + "\n", + "1. x increases by more than 100 from one sample to the next\n", + " * $|x_{i}-x_{i-1}| > 100$\n", + "1. x is lower than 10\n", + " * $x_i < 10$\n", + " \n", + "### Analysis\n", + "\n", + "* The second condition should be easy for us to implement. We've aleady performed multiple examples of Boolean filtering.\n", + "* The first condition is more challenging because it requires taking differences between rows and to date we've only computed on single rows\n", + "\n", + "### Approaches to Computing Differences\n", + "\n", + "1. We could use the Pandas DataFrame [``shift``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html) method\n", + "1. We could use the Pandas DataFrame [``diff``](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.diff.html) method" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Shift Approach" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[\"invSYS\"] - data[\"invSYS\"].shift(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise: \n", + "### Use the shift method to implement the maximum difference filter" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Diff Method" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data[\"invSYS\"].diff(-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise: \n", + "### Use the diff method to implement the maximum difference filter" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Built-in Pandas Computational Tools\n", + "### Pandas provides a number of functions that for smoothing data that might be of value\n", + "#### [Window Functions](http://pandas.pydata.org/pandas-docs/stable/computation.html)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/modules/python_time/datetime_in_python.ipynb b/modules/python_time/datetime_in_python.ipynb new file mode 100644 index 0000000..221f692 --- /dev/null +++ b/modules/python_time/datetime_in_python.ipynb @@ -0,0 +1,196 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Python's [``datetime``](https://docs.python.org/3.5/library/datetime.html#module-datetime) package\n", + "\n", + "### Python documentation description of ``datetime``\n", + "\n", + ">The datetime module supplies classes for manipulating dates and times in both simple and complex ways. While date and time arithmetic is supported, the focus of the implementation is on efficient attribute extraction for output formatting and manipulation. For related functionality, see also the time and calendar modules.\n", + ">\n", + "There are two kinds of date and time objects: “naive” and “aware”.\n", + ">\n", + "An aware object has sufficient knowledge of applicable algorithmic and political time adjustments, such as time zone and daylight saving time information, to locate itself relative to other aware objects. An aware object is used to represent a specific moment in time that is not open to interpretation.\n", + ">\n", + "A naive object does not contain enough information to unambiguously locate itself relative to other date/time objects. Whether a naive object represents Coordinated Universal Time (UTC), local time, or time in some other timezone is purely up to the program, just like it is up to the program whether a particular number represents metres, miles, or mass. Naive objects are easy to understand and to work with, at the cost of ignoring some aspects of reality. ([Python documentation](https://docs.python.org/3.5/library/datetime.html#module-datetime))\n", + "\n", + "``datetime`` defines two objects:\n", + "\n", + "* [``date``](https://docs.python.org/3.5/library/datetime.html#datetime.date)\n", + " * A representation of dates with year, month, day\n", + "* [``time``](https://docs.python.org/3.5/library/datetime.html#time-objects)\n", + " * A class for representing time (not sure of relationship with the ``time`` module)\n", + "* [``datetime``](https://docs.python.org/3.5/library/datetime.html#datetime.datetime)\n", + " * A combination of the date class and the time class\n", + "* [``timedelta``](https://docs.python.org/3.5/library/datetime.html#datetime.timedelta)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import datetime\n", + "import time" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating datetime objects" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### We can create datetime objects using the class constructor\n", + "\n", + "If you do ``help(datetime.datetime)`` you will find that year, month, and day are **positional arguments**, and there are a variety of **keyword arguments** for hours, minutes, etc." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "start_time = datetime.datetime(1994, 9, 26, hour=7, minute=45)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### There are [``today``](https://docs.python.org/3.5/library/datetime.html#datetime.datetime.today) and [``now``](https://docs.python.org/3.5/library/datetime.html#datetime.datetime.utcnow) class methods for determining the current tme without and with time zone support" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "now = datetime.datetime.now()\n", + "print(now)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### If I have a timestamp (e.g. from ``time.time()``) I can create a datetime object from it" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mytime = time.time()\n", + "mydatetime = datetime.datetime.fromtimestamp(mytime)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(datetime.time)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [``timedelta``](https://docs.python.org/3.5/library/datetime.html#datetime.timedelta)\n", + "\n", + "datetime and time instances are valuable largely because we can reason with them. For example, I can do comparisons and arithmetic between two dates." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Now greater than start_time:\",now > start_time)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Now less than start_time:\", now < start_time)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "delta = now - start_time\n", + "print(type(delta))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using our ``timedelta`` object we can compute the elapsed number of days (or seconds)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"number of elapsed days:\", delta.days)\n", + "print(\"number of elapsed seconds:\", delta.total_seconds())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/modules/python_time/mimic2_timeseries_admissions.ipynb b/modules/python_time/mimic2_timeseries_admissions.ipynb new file mode 100644 index 0000000..5405e15 --- /dev/null +++ b/modules/python_time/mimic2_timeseries_admissions.ipynb @@ -0,0 +1,106 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Identifying Patient Cohorts in [MIMIC-II](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3124312/)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pymysql\n", + "import pandas as pd\n", + "import getpass\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import datetime\n", + "import time\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "conn = pymysql.connect(host=\"mysql\",\n", + " port=3306,user=\"jovyan\",\n", + " passwd=getpass.getpass(\"Enter MySQL passwd for jovyan\"),db='mimic2')\n", + "cursor = conn.cursor()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pd.read_sql(\"\"\"SELECT * FROM admissions LIMIT 50\"\"\", conn).head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise: Create a Histogram of the length of stay for subjects in the database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise: Create a histogram of the day of the week when patients are admitted" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/modules/python_time/mimic2_timeseries_heartrate.ipynb b/modules/python_time/mimic2_timeseries_heartrate.ipynb new file mode 100644 index 0000000..3f3f3cf --- /dev/null +++ b/modules/python_time/mimic2_timeseries_heartrate.ipynb @@ -0,0 +1,171 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Identifying Patient Cohorts in [MIMIC-II](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3124312/)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pymysql\n", + "import pandas as pd\n", + "import getpass\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import datetime\n", + "import time\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "conn = pymysql.connect(host=\"mysql\",\n", + " port=3306,user=\"jovyan\",\n", + " passwd=getpass.getpass(\"Enter MySQL passwd for jovyan\"),db='mimic2')\n", + "cursor = conn.cursor()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example Query: Heart Rate and Blood Pressure\n", + "\n", + "#### Select a patient from the following ids\n", + "* 12613\n", + "* 11923\n", + "* 517\n", + "* 14898" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hr = pd.read_sql(\"\"\"SELECT subject_id, \n", + " icustay_id, \n", + " charttime, \n", + " realtime,\n", + " value1num,\n", + " value1uom\n", + " FROM chartevents\n", + " WHERE itemid in (211) AND\n", + " subject_id in (11923)\"\"\"\n", + " ,conn)\n", + "hr.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bp = pd.read_sql(\"\"\"SELECT subject_id, \n", + " icustay_id, \n", + " charttime, \n", + " realtime,\n", + " value1num,\n", + " value1uom,\n", + " value2num,\n", + " value2uom\n", + " FROM chartevents\n", + " WHERE itemid in (6, 51, 455, 6701) AND\n", + " subject_id in (11923)\"\"\"\n", + " ,conn)\n", + "bp.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(bp[\"icustay_id\"].value_counts())\n", + "print(bp[\"subject_id\"].value_counts())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hr[\"icustay_id\"].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Plot Heart Rate as a Time Series" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hr.plot(x=\"realtime\", y=\"value1num\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/modules/python_time/time_in_python.ipynb b/modules/python_time/time_in_python.ipynb new file mode 100644 index 0000000..dae6a40 --- /dev/null +++ b/modules/python_time/time_in_python.ipynb @@ -0,0 +1,412 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Time Data in Python\n", + "#### © Brian E. Chapman, Ph.D.\n", + "Within the Python standard library there are three primary modules related to time:\n", + "\n", + "* [``time``](https://docs.python.org/3.5/library/time.html)\n", + "* [``datetime``](https://docs.python.org/3.5/library/datetime.html#module-datetime)\n", + "* [``calendar``](https://docs.python.org/3.5/library/calendar.html#module-calendar)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## [``time``](https://docs.python.org/3.5/library/time.html)\n", + "\n", + "Let's start with the simplest function in the ``time`` module: ``time``.\n", + "\n", + "``time.time`` returns the number of elapsed seconds since the \"epoch.\" \n", + "\n", + ">The epoch is the point where the time starts. On January 1st of that year, at 0 hours, the “time since the epoch” is zero. For Unix, the epoch is 1970. To find out what the epoch is, look at gmtime(0). ([Python documentation on the epoch](https://docs.python.org/3.5/library/time.html))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import time" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "time.gmtime(0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "time.time()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(type(time.asctime()))\n", + "print(time.asctime())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(type(time.ctime()))\n", + "print(time.ctime())\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "time.gmtime()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "time.localtime()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "time.strftime" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "time.strptime" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "time.timezone" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## ``time.struct_time``\n", + "\n", + "Python defines a class ``struct_time`` that inherits from the builtin type ``tuple``. The ``struct_time`` class defines attributes needed for unambiguously describing and computing about time.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "help(time.struct_time)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating strings from ``time_struct``\n", + "\n", + "Within our programs we would probably be keeping time data in a ``time_struct`` but we at times might want to present times to users in a more human friendly form. The ``time`` module defines the [``strftime``](https://docs.python.org/3.5/library/time.html#time.strftime) function for creating a string from a ``time_struct`` instance. \n", + "\n", + "#### First create a ``time_struct`` instance for my current time" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mytime = time.localtime()\n", + "print(time.strftime(\"%B %d, %Y\", mytime))\n", + "print(time.strftime(\"%d %b %Y\", mytime))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## [``locale``](https://docs.python.org/3.5/library/locale.html)\n", + "\n", + "As we have pointed out, styles for representing time varying across the world. There are a number of other styles that vary across the world. Currency is one obvious variance with $, £, and € being three common western currency symbols. There are also differences in numeric representations. For example, in the United States we use a comma (\",\") as a thousands separater while in countries like France this is the decimal point.\n", + "\n", + "Python provides a ``locale`` package to address thee differences. This Python package sits on top of your operating system's programs for handeling locale variation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import locale \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Use tab completion to see what choices locale provides to what can be localized (e.g. currency)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "locale.LC_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What locales are supported?\n", + "\n", + "On linux we can run ``locale -a`` on the command line to see what locale's I can work with." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!locale -a" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### To get a feel for locale and time we will render our current time in a variety of locale standards" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### German" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "locale.setlocale(locale.LC_TIME, \"de_DE.UTF-8\")\n", + "\n", + "print(time.strftime(\"%B %d, %Y\", mytime))\n", + "print(time.strftime(\"%d %b %Y\", mytime))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Spanish" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "locale.setlocale(locale.LC_TIME, \"es_ES.UTF-8\")\n", + "\n", + "print(time.strftime(\"%B %d, %Y\", mytime))\n", + "print(time.strftime(\"%d %b %Y\", mytime))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Japanese" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "locale.setlocale(locale.LC_ALL, \"ja_JP.UTF-8\")\n", + "\n", + "print(time.strftime(\"%B %d, %Y\", mytime))\n", + "print(time.strftime(\"%d %b %Y\", mytime))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Chinese" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "locale.setlocale(locale.LC_ALL, \"zh_CN.UTF-8\")\n", + "\n", + "print(time.strftime(\"%B %d, %Y\", mytime))\n", + "print(time.strftime(\"%d %b %Y\", mytime))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Russian" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "locale.setlocale(locale.LC_ALL, \"ru_RU.utf8\")\n", + "\n", + "print(time.strftime(\"%B %d, %Y\", mytime))\n", + "print(time.strftime(\"%d %b %Y\", mytime))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "locale.setlocale(locale.LC_ALL, \"el_GR.UTF-8\")\n", + "\n", + "print(time.strftime(\"%B %d, %Y\", mytime))\n", + "print(time.strftime(\"%d %b %Y\", mytime))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set locale back to your local locale" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "locale.setlocale(locale.LC_ALL, \"\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "time.strftime(\"%x %X \", mytime)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parsing Time Strings\n", + "\n", + "In our data science application it is more likely that we will need to take a string in some (arbitrary) format and parse it into a ``time_struct``. This is achieved with the [``strptime`` function](https://docs.python.org/3.5/library/time.html#time.strptime) that is essentially the inverse of ``strftime``.\n", + "\n", + "## Exercise\n", + "\n", + "#### Write code to parse into ``struct_time`` instances the following dates and times:\n", + "\n", + "* \"January 27, 2016\"\n", + "* \"2015 Feb 1\"\n", + "* \"12/04/15\"\n", + "* \"24/05/1968\"\n", + "* \"07/27/2016 23:07:45\"\n", + "* \"Mar 17, 2014 11:17 PM\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}