ONA 2023, August 2023
I added links, but THERE WILL BE SO MUCH MORE HERE LATER!!!!
Hi, I'm Jonathan Soma! I run the Data Journalism MS and Lede Program at Columbia, where I am Knight Chair in Data Journalism.
Contact me at js4571@columbia.edu or on Twitter.
Talk content:
Some of my websites:
- aifaq.wtf - collection of weird AI stuff
- normalai.org - many examples like what we did in the session
- https://investigate.ai/ - more traditional ML
- A few blog posts
- How ChatGPT turned generative AI into an “anything tool” (Ars Technica)
- Hugging Face models page
- Prompt Engineering for Developers course
- Apple says its App Store is ‘a safe and trusted place.’ We found 1,500 reports of unwanted sexual behavior on six apps, some targeting minors. (WATCH THE VIDEO!!! It has spreadsheets!!!)
- Predicting reports of bullying, racism, and unwanted sexual behavior from app store reviews - an old-school, traditional machine learning approach to the Washington Post app reviews classification problem
- Stemming and lemmatization - how to deal with words with different endings (fishes/fishing/fished, running/runs/ran) in traditional machine learning
- creepy-wapo - a Hugging Face space where you can demo my fine-tuned creepy-wapo model (warning: it's pretty awful at classifying things)
- Hugging Face AutoTrain - the "just upload your spreadsheet or images" way to fine-tune models
- Zero-shot classification - a tiny example from a workshop I gave in Brazil
- Named entity recognition several ways in Python
- GENIE from La Nación – gender gap source analysis tool
- Source Matters - "track and improve the diversity of sources in your news stories," from American Press Institute
Older examples of using embeddings
- Conceptual document similarity with word embeddings
- Comparing documents across languages with Universal Sentence Encoding and Tensorflow
More modern approaches to embeddings (using LangChain mostly)
-
Multi-language document Q&A - a more modern usage of text embeddings
-
How I convinced GPT to teach me about Hungarian folktales (without speaking a word of Hungarian) - a talk on text embeddings/multi-language document Q&A/document search from MediaParty Chicago 2023
-
Semantra, a "multi-tool for semantic search"
-
"Chat with your data" online course - there are so many of these jeez
- Leprosy of the land - an investigation into illegal amber mines in Ukraine, powered by machine learning
- Object Detection - (finding "things") aka instance segmentation, live example
- Semantic segmentation (finding "stuff") to identify vegetation - live example
- Panoptic segmentation
- Hugging Face autotrain
- Prodi.gy - a product from explosion, the same company as spaCy
- Roboflow - an online tool to do fine-tuning of image/video models
- How to hide faces and scrub metadata when you photograph a protest
- How The New York Times Uses Software To Recognize Members of Congress
- The AIJO Project - gender detection of faces on news segments
- BBB 23: Inteligência artificial revela quem mais apareceu nos VTs do programa - using facial recognition to analyze screen time of members of Big Brother Brasil
- Runway ML - an example of generative video
- Whisper - a great model for transcription (this is easier than trying to use it with Hugging Face tools, too) - live example
- Speaker diarization with pyannote
- SpeechX from Microsoft - the demos are INSANE
- Build dashboards/interactives/playgrounds with Gradio or Streamlit
- Document Summaries in Danish with OpenAI - a great example of testing/tracking AI use in the newsroom